Simba ODBC Driver with SQL Connector for Apache Spark ... · Simba ODBC Driver with SQL Connector...

31
Simba ODBC Driver with SQL Connector for Apache Spark Installation and Configuration Guide November 21, 2014 Simba Technologies Inc.

Transcript of Simba ODBC Driver with SQL Connector for Apache Spark ... · Simba ODBC Driver with SQL Connector...

Page 1: Simba ODBC Driver with SQL Connector for Apache Spark ... · Simba ODBC Driver with SQL Connector for Apache Spark is used for direct SQL and Spark SQL access to Apache Hadoop / Spark

Simba ODBC Driver with SQL Connector

for Apache Spark

Installation and Configuration Guide

November 21, 2014

Simba Technologies Inc.

Page 2: Simba ODBC Driver with SQL Connector for Apache Spark ... · Simba ODBC Driver with SQL Connector for Apache Spark is used for direct SQL and Spark SQL access to Apache Hadoop / Spark

Simba ODBC Driver with SQL Connector for Apache Spark Installation and Configuration Guide

Copyright © 2012-2014 Simba Technologies Inc. All Rights Reserved.

Information in this document is subject to change without notice. Companies, names and data used in examples herein are fictitious unless otherwise noted. No part of this publication, or the software it describes, may be reproduced, transmitted, transcribed, stored in a retrieval system, decompiled, disassembled, reverse-engineered, or translated into any language in any form by any means for any purpose without the express written permission of Simba Technologies Inc.

Trademarks

Simba, the Simba logo, SimbaEngine, SimbaEngine C/S, SimbaExpress and SimbaLib are registered trademarks of Simba Technologies Inc. All other trademarks and/or servicemarks are the property of their respective owners.

Cyrus SASL

Copyright (c) 1998-2003 Carnegie Mellon University. All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. The name "Carnegie Mellon University" must not be used to endorse or promote products derived from this software without prior written permission. For permission or any other legal details, please contact

Office of Technology Transfer Carnegie Mellon University 5000 Forbes Avenue Pittsburgh, PA 15213-3890 (412) 268-4387, fax: (412) 268-7395 [email protected]

4. Redistributions of any form whatsoever must retain the following acknowledgment:

"This product includes software developed by Computing Services at Carnegie Mellon University (http://www.cmu.edu/computing/)."

CARNEGIE MELLON UNIVERSITY DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS, IN NO EVENT SHALL CARNEGIE MELLON UNIVERSITY BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

www.simba.com 2

Page 3: Simba ODBC Driver with SQL Connector for Apache Spark ... · Simba ODBC Driver with SQL Connector for Apache Spark is used for direct SQL and Spark SQL access to Apache Hadoop / Spark

Simba ODBC Driver with SQL Connector for Apache Spark Installation and Configuration Guide

ICU License - ICU 1.8.1 and later

COPYRIGHT AND PERMISSION NOTICE

Copyright (c) 1995-2010 International Business Machines Corporation and others. All rights reserved.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, provided that the above copyright notice(s) and this permission notice appear in all copies of the Software and that both the above copyright notice(s) and this permission notice appear in supporting documentation.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT OF THIRD PARTY RIGHTS. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR HOLDERS INCLUDED IN THIS NOTICE BE LIABLE FOR ANY CLAIM, OR ANY SPECIAL INDIRECT OR CONSEQUENTIAL DAMAGES, OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

Except as contained in this notice, the name of a copyright holder shall not be used in advertising or otherwise to promote the sale, use or other dealings in this Software without prior written authorization of the copyright holder.

All trademarks and registered trademarks mentioned herein are the property of their respective owners.

OpenSSL

Copyright (c) 1998-2008 The OpenSSL Project. All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. All advertising materials mentioning features or use of this software must display the following acknowledgment:

"This product includes software developed by the OpenSSL Project for use in the OpenSSL Toolkit. (http://www.openssl.org/)"

www.simba.com 3

Page 4: Simba ODBC Driver with SQL Connector for Apache Spark ... · Simba ODBC Driver with SQL Connector for Apache Spark is used for direct SQL and Spark SQL access to Apache Hadoop / Spark

Simba ODBC Driver with SQL Connector for Apache Spark Installation and Configuration Guide

4. The names "OpenSSL Toolkit" and "OpenSSL Project" must not be used to endorse or promote products derived from this software without prior written permission. For written permission, please contact [email protected].

5. Products derived from this software may not be called "OpenSSL" nor may "OpenSSL" appear in their names without prior written permission of the OpenSSL Project.

6. Redistributions of any form whatsoever must retain the following acknowledgment:

"This product includes software developed by the OpenSSL Project for use in the OpenSSL Toolkit (http://www.openssl.org/)"

THIS SOFTWARE IS PROVIDED BY THE OpenSSL PROJECT ``AS IS'' AND ANY EXPRESSED OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE OpenSSL PROJECT OR ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Apache Spark

Copyright 2012-2014 The Apache Software Foundation.

Apache Thrift

Copyright 2006-2010 The Apache Software Foundation.

Expat

Copyright (c) 1998, 1999, 2000 Thai Open Source Software Center Ltd

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the ""Software""), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NOINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

www.simba.com 4

Page 5: Simba ODBC Driver with SQL Connector for Apache Spark ... · Simba ODBC Driver with SQL Connector for Apache Spark is used for direct SQL and Spark SQL access to Apache Hadoop / Spark

Simba ODBC Driver with SQL Connector for Apache Spark Installation and Configuration Guide

libcurl

COPYRIGHT AND PERMISSION NOTICE

Copyright (c) 1996 - 2012, Daniel Stenberg, <[email protected]>.

All rights reserved.

Permission to use, copy, modify, and distribute this software for any purpose with or without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT OF THIRD PARTY RIGHTS. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Except as contained in this notice, the name of a copyright holder shall not be used in advertising or otherwise to promote the sale, use or other dealings in this Software without prior written authorization of the copyright holder.

Contact Us

Simba Technologies Inc. 938 West 8th Avenue Vancouver, BC Canada V5Z 1E5

www.simba.com

Telephone: +1 (604) 633-0008 Information and Product Sales: Extension 2 Technical Support: Extension 3

Fax: +1 (604) 633-0004

Information and Product Sales: [email protected] Technical Support: [email protected] Follow us on Twitter: @SimbaTech

Printed in Canada

www.simba.com 5

Page 6: Simba ODBC Driver with SQL Connector for Apache Spark ... · Simba ODBC Driver with SQL Connector for Apache Spark is used for direct SQL and Spark SQL access to Apache Hadoop / Spark

Simba ODBC Driver with SQL Connector for Apache Spark Installation and Configuration Guide

Table of Contents Introduction ........................................................................................................................................................7

Windows Driver ..................................................................................................................................................7 System Requirements ................................................................................................................................7 Installing the Driver ...................................................................................................................................8 Configuring ODBC Connections .............................................................................................................8 Configuring Authentication ................................................................................................................. 11

Linux Driver ...................................................................................................................................................... 12 System Requirements ............................................................................................................................. 12 Installation Using the RPM ................................................................................................................... 12 Installation Using the Tarball Package ............................................................................................. 13 Setting the LD_LIBRARY_PATH Environment Variable ............................................................... 14

Configuring ODBC Connections for Linux............................................................................................. 14 Files ............................................................................................................................................................... 14 Sample Files ................................................................................................................................................ 15 Configuring the Environment .............................................................................................................. 15 Configuring the odbc.ini File ............................................................................................................... 16 Configuring the odbcinst.ini File ........................................................................................................ 16 Configuring the simba.sparkodbc.ini File ........................................................................................ 17 Configuring Authentication ................................................................................................................. 17

Features ............................................................................................................................................................. 18 SQL Query versus Spark SQL Query.................................................................................................... 18 SQL Connector ........................................................................................................................................... 19 Data Types ................................................................................................................................................... 19 Catalog and Schema Support .............................................................................................................. 19 Spark System Table .................................................................................................................................. 19 Server-side Properties ............................................................................................................................ 20 Get Tables With Query ............................................................................................................................ 20

Known Issues in Spark .................................................................................................................................. 20 Backquotes in aliases are not handled correctly .......................................................................... 20 Filtering on TIMESTAMP columns does not return any rows ................................................... 20 Cannot use AND to combine a TIMESTAMP column filter with another filter .................. 21

Contact Us ........................................................................................................................................................ 21

Appendix A: Authentication Options ...................................................................................................... 22 Using No Authentication ....................................................................................................................... 22 Using Kerberos .......................................................................................................................................... 23 Using User Name ...................................................................................................................................... 23 Using User Name and Password .......................................................................................................... 23

Appendix B: Configuring Kerberos Authentication for Windows ................................................ 24 MIT Kerberos .............................................................................................................................................. 24

Appendix C: Driver Configuration Options ........................................................................................... 27

www.simba.com 6

Page 7: Simba ODBC Driver with SQL Connector for Apache Spark ... · Simba ODBC Driver with SQL Connector for Apache Spark is used for direct SQL and Spark SQL access to Apache Hadoop / Spark

Simba ODBC Driver with SQL Connector for Apache Spark Installation and Configuration Guide

Introduction Welcome to the Simba ODBC Driver with SQL Connector for Apache Spark. ODBC is one the most established and widely supported APIs for connecting to and working with databases. At the heart of the technology is the ODBC driver, which connects an application to the database.

Simba ODBC Driver with SQL Connector for Apache Spark is used for direct SQL and Spark SQL access to Apache Hadoop / Spark distributions, enabling Business Intelligence (BI), analytics and reporting on Hadoop-based data. The driver efficiently transforms an application’s SQL query into the equivalent form in Spark SQL, which is a subset of SQL-92. If an application is Spark-aware, then the driver is configurable to pass the query through. The driver interrogates Spark to obtain schema information to present to a SQL-based application. Queries, including joins, are translated from SQL to Spark SQL. For more information about the differences between Spark SQL and SQL, refer to the section “Features” on page 18. Simba ODBC Driver with SQL Connector for Apache Spark complies with the ODBC 3.52 data standard and adds important functionality such as Unicode and 32- and 64-bit support for high-performance computing environments.

This guide is suitable for users who are looking to access data residing within Hadoop from their desktop environment. Application developers may also find the information helpful. Please refer to your application for details on connecting via ODBC.

Windows Driver System Requirements

You install Simba ODBC Driver with SQL Connector for Apache Spark on client computers accessing data in a Hadoop cluster with the Spark service installed and running. Each computer where you install the driver must meet the following minimum system requirements:

• One of the following operating systems (32- and 64-bit editions are supported):

o Windows® Vista

o Windows® 7 Professional

o Windows® 8

o Windows® 8.1

o Windows® Server 2008 R2

• 25 MB of available disk space

Important: To install the driver, you need Administrator privileges on the computer.

The driver is suitable for use with all versions of Apache Spark.

www.simba.com 7

Page 8: Simba ODBC Driver with SQL Connector for Apache Spark ... · Simba ODBC Driver with SQL Connector for Apache Spark is used for direct SQL and Spark SQL access to Apache Hadoop / Spark

Simba ODBC Driver with SQL Connector for Apache Spark Installation and Configuration Guide

Installing the Driver On 64-bit Windows operating systems, you can execute 32- and 64-bit applications transparently. You must use the version of the driver matching the bitness of the client application accessing data in Hadoop / Spark:

• SimbaSparkODBC32.msi for 32-bit applications

• SimbaSparkODBC64.msi for 64-bit applications

You can install both versions of the driver on the same computer.

Note: For an explanation of how to use ODBC on 64-bit editions of Windows, see http://www.simba.com/wp-content/uploads/2010/10/HOW-TO-32-bit-vs-64-bit-ODBC-Data-Source-Administrator.pdf

To install Simba ODBC Driver with SQL Connector for Apache Spark:

1. Depending on the bitness of your client application, double-click to run SimbaSparkODBC32.msi or SimbaSparkODBC64.msi

2. Click Next

3. Select the check box to accept the terms of the License Agreement if you agree, and then click Next

4. To change the installation location, click the Change button, then browse to the desired folder, and then click OK. To accept the installation location, click Next

5. Click Install

6. When the installation completes, click Finish

7. If you received a license file via e-mail, then copy the license file into the \lib subfolder in the installation folder you selected in step 4.

Configuring ODBC Connections To create a Data Source Name (DSN):

1. Click the Start button .

2. Click All Programs.

3. Click the Simba Spark ODBC Driver 1.0 (64-bit) or the Simba Spark ODBC Driver 1.0 (32-bit) program group. If you installed both versions of the driver, you will see two program groups.

Because DSNs are bit-specific, select the version that matches the bitness of your application. For example, a DSN that is defined for the 32-bit driver will only be accessible from 32-bit applications.

4. Click 64-bit ODBC Administrator or 32-bit ODBC Administrator. The ODBC Data Source Administrator window opens.

5. Click the Drivers tab and verify that the Simba Spark ODBC Driver appears in the list of ODBC drivers that are installed on your system.

www.simba.com 8

Page 9: Simba ODBC Driver with SQL Connector for Apache Spark ... · Simba ODBC Driver with SQL Connector for Apache Spark is used for direct SQL and Spark SQL access to Apache Hadoop / Spark

Simba ODBC Driver with SQL Connector for Apache Spark Installation and Configuration Guide

6. Click the System DSN tab to create a system DSN or click the User DSN tab to create a user DSN.

Note: A system DSN can be seen by all users that login to a workstation. A user DSN is specific to a user on the workstation. It can only be seen by the user who creates it.

7. Click Add. The Create New Data Source window opens.

8. Select Simba Spark ODBC Driver and then click Finish. The Simba Spark ODBC Driver DSN Setup window opens.

9. In the Data Source Name field, type a name for your DSN.

10. Optionally, in the Description field, type relevant details related to the DSN.

11. In the Host field, type the IP address or hostname of the Spark server.

12. In the Port field, type the listening port for the service.

13. In the Database field, type the name of the database schema to use when a schema is not explicitly specified in a query.

Note: Queries on other schemas can still be issued by explicitly specifying the schema in the query. To determine the appropriate database schema to use, type the show databases command at the Spark command prompt to inspect your databases.

14. In the Spark Server Type list, select the appropriate server type for the version of Spark that you are running:

• If you are running Shark 0.8.1 and earlier, then select SharkServer

• If you are running Shark 0.9.*, then select SharkServer2

• If you are running Spark 1.1 and later, then select SparkThriftServer

15. In the Authentication area, configure authentication as needed. For detailed instructions, refer to the section "Configuring Authentication" on page 11.

16. Optionally, click Advanced Options. In the Advanced Options window:

a. Select the Use Native Query check box to disable the SQL Connector feature.

Note: The SQL Connector feature has been added to the driver to apply transformations to the queries emitted by an application to convert them into an equivalent form in Spark SQL. If the application is Spark-aware and already emits Spark SQL then turning off the SQL Connector feature avoids the extra overhead of query transformation.

b. Select the Fast SQLPrepare check box to defer query execution to SQLExecute.

Note: When using Native Query mode, the driver will execute the Spark SQL query to retrieve the result set metadata for SQLPrepare. As a result, SQLPrepare might be slow. If the result set metadata is not required after calling SQLPrepare, then enable this option.

c. Select the Get Tables With Query check box to retrieve the names of tables in a particular database using the SHOW TABLES query.

Note: This setting is only applicable when connecting to Shark Server 2 or Spark Thrift Server.

www.simba.com 9

Page 10: Simba ODBC Driver with SQL Connector for Apache Spark ... · Simba ODBC Driver with SQL Connector for Apache Spark is used for direct SQL and Spark SQL access to Apache Hadoop / Spark

Simba ODBC Driver with SQL Connector for Apache Spark Installation and Configuration Guide

d. In the Rows fetched per block field, type the number of rows to be fetched per block.

Note: Any positive 32-bit integer is a valid value, but testing has shown that performance gains are marginal beyond the default value of 10000 rows.

e. In the Default string column length field, type the maximum data length for string columns.

Note: Spark does not provide the maximum data length for String columns in the columns metadata. This option allows you to tune the maximum data length for String columns.

f. In the Binary column length field, type the maximum data length for binary columns.

Note: Spark does not provide the maximum data length for Binary columns in the columns metadata. The option allows you to tune the maximum data length for Binary columns.

g. In the Decimal column scale field, type the maximum number of digits to the right of the decimal point for numeric data types.

h. To create a server-side property, click the Add button, then type appropriate values in the Key and Value fields, and then click OK

OR

To edit a server-side property, select the property to edit in the Server Side Properties area, then click the Edit button, then update the Key and Value fields as needed, and then click OK

OR

To delete a server-side property, select the property to remove in the Server Side Properties area, and then click the Remove button. In the confirmation dialog, click Yes

Note: For a list of all Hadoop and Spark server-side properties that your implementation supports, type set -v at the Spark CLI command line. You can also execute the set -v query after connecting using the driver.

i. If you selected SharkServer2 or SparkThriftServer as the Spark server type, then select or clear the Apply server side properties with queries check box as needed.

Note: If you selected SharkServer2 or SparkThriftServer, then the Apply server side properties with queries check box is selected by default. Selecting the check box configures the driver to apply each server-side property you set by executing a query when opening a session to the server. Clearing the check box configures the driver to use a more efficient method to apply server-side properties that does not involve additional network round-tripping. Some Shark Server 2 and Spark Thrift Server builds are not compatible with the more efficient method. If the server-side properties you set do not take effect when the check box is clear, then select the check box. If you selected SharkServer as the Spark server type, then the Apply server side properties with queries check box is selected and unavailable.

www.simba.com 10

Page 11: Simba ODBC Driver with SQL Connector for Apache Spark ... · Simba ODBC Driver with SQL Connector for Apache Spark is used for direct SQL and Spark SQL access to Apache Hadoop / Spark

Simba ODBC Driver with SQL Connector for Apache Spark Installation and Configuration Guide

j. Select the Convert SSP Key Name to Lower Case check box to force the driver to convert server side property key name to all lower case characters.

k. Click OK

17. Click Test to test the connection and then click OK

Configuring Authentication For details on selecting the appropriate authentication for a DSN using Simba ODBC Driver with SQL Connector for Apache Spark, see “Appendix A: Authentication Options” on page 22.

Using No Authentication

To connect to a Spark server without authenticating the connection:

Click the drop-down arrow next to the Mechanism field, and then select No Authentication

Note: When connecting to a Spark server of type Shark Server, use No Authentication.

Using Kerberos

To use Kerberos authentication, Kerberos must be configured prior to use. See “Appendix B: Configuring Kerberos Authentication for Windows” on page 24 for details.

After Kerberos has been installed and configured, then set the following options in the Authentication area in the Simba Spark ODBC Driver DSN Setup dialog:

1. Click the drop-down arrow next to the Mechanism field, and then select Kerberos

2. If no default realm is configured for your Kerberos implementation or the realm of your Shark Server 2 or Spark Thrift Server is not the default, then type the value for the Kerberos realm of the Shark Server 2 or Spark Thrift Server host in the Realm field. To use the default realm, leave the Realm field empty.

3. In the Host FQDN field, type the value for the fully qualified domain name of the Shark Server 2 or Spark Thrift Server host.

4. In the Service Name field, type the value for the service name of the server.

Using User Name

Authenticating by user name does not use a password. The user name labels the session, facilitating database tracking.

To configure your DSN for user name authentication:

1. In the Simba Spark ODBC Driver DSN Setup dialog, click the drop-down arrow next to the Mechanism field, and then select User Name

2. In the User Name field, type an appropriate credential.

www.simba.com 11

Page 12: Simba ODBC Driver with SQL Connector for Apache Spark ... · Simba ODBC Driver with SQL Connector for Apache Spark is used for direct SQL and Spark SQL access to Apache Hadoop / Spark

Simba ODBC Driver with SQL Connector for Apache Spark Installation and Configuration Guide

Using User Name and Password

To configure your DSN for user name and password authentication:

1. In the Simba Spark ODBC Driver DSN Setup dialog, click the drop-down arrow next to the Mechanism field, and then select User Name and Password

2. In the User Name field, type an appropriate credential.

3. In the Password field, type the password corresponding to the user name you typed in step 2.

Linux Driver System Requirements

• One of the following distributions (32- and 64-bit editions are supported):

o Red Hat® Enterprise Linux® (RHEL) 5.0

o CentOS 5.0

o SUSE Linux Enterprise Server (SLES) 11

• 45 MB of available disk space.

• One of the following ODBC driver managers installed:

o iODBC 3.52.7 or above

o unixODBC 2.3.0 or above

Simba ODBC Driver with SQL Connector for Apache Spark requires a Hadoop cluster with the Spark service installed and running.

Simba ODBC Driver with SQL Connector for Apache Spark is suitable for use with all versions of Spark.

Installation Using the RPM You can install the driver using RPMs. There are two versions of the driver for Linux:

• SimbaSparkODBC-Version-Release.i386.rpm for 32-bit

• SimbaSparkODBC-Version-Release.x86_64.rpm for 64-bit

The version of the driver that you select should match the bitness of the client application accessing your Hadoop-based data. For example, if the client application is 64-bit, then you should install the 64-bit driver. Note that 64-bit editions of Linux support both 32- and 64-bit applications. Verify the bitness of your intended application and install the appropriate version of the driver.

Important: Ensure that you install the driver using the RPM corresponding to your Linux distribution.

www.simba.com 12

Page 13: Simba ODBC Driver with SQL Connector for Apache Spark ... · Simba ODBC Driver with SQL Connector for Apache Spark is used for direct SQL and Spark SQL access to Apache Hadoop / Spark

Simba ODBC Driver with SQL Connector for Apache Spark Installation and Configuration Guide

Simba ODBC Driver with SQL Connector for Apache Spark driver files are installed in the following directories:

• /opt/simba/sparkodbc/ErrorMessages–Error messages files directory

• /opt/simba/sparkodbc/Setup–Sample configuration files directory

• /opt/simba/sparkodbc/lib/32–32-bit shared libraries directory

• /opt/simba/sparkodbc/lib/64–64-bit shared libraries directory

To install Simba ODBC Driver with SQL Connector for Apache Spark:

1. In Red Hat Enterprise Linux 5.0 or CentOS 5.0, log in as the root user, then navigate to the folder containing the driver RPM packages to install, and then type the following at the command line, where RPMFileName is the file name of the RPM package containing the version of the driver that you want to install: yum --nogpgcheck localinstall RPMFileName

OR

In SUSE Linux Enterprise Server 11, log in as the root user, then navigate to the folder containing the driver RPM packages to install, and then type the following at the command line, where RPMFileName is the file name of the RPM package containing the version of the driver that you want to install: zypper install RPMFileName

2. If you received a license file via e-mail, then copy the license file into the /opt/simba/sparkodbc/lib/32 or /opt/simba/sparkodbc/lib/64 folder, depending on the version of the driver that you installed.

Simba ODBC Driver with SQL Connector for Apache Spark depends on the following resources:

• cyrus-sasl-2.1.22-7 or above

• cyrus-sasl-gssapi-2.1.22-7 or above

• cyrus-sasl-plain-2.1.22-7 or above

If the package manager in your Linux distribution cannot resolve the dependencies automatically when installing the driver, then download and manually install the packages required by the version of the driver that you want to install.

Installation Using the Tarball Package Alternatively, the Simba ODBC Driver with SQL Connector for Apache Spark is available for installation using a TAR.GZ tarball package. The tarball package includes:

• [INSTALL_DIR]/simba/sparkodbc/ contains release notes, the Simba ODBC Driver with SQL Connector for Apache Spark Installation and Configuration Guide in PDF format and a Readme.txt file that provides plain text installation and configuration instructions.

• [INSTALL_DIR]/simba/sparkodbc/lib/32 contains the 32-bit Simba Spark ODBC Driver for Linux and simba.sparkodbc.ini.

www.simba.com 13

Page 14: Simba ODBC Driver with SQL Connector for Apache Spark ... · Simba ODBC Driver with SQL Connector for Apache Spark is used for direct SQL and Spark SQL access to Apache Hadoop / Spark

Simba ODBC Driver with SQL Connector for Apache Spark Installation and Configuration Guide

• [INSTALL_DIR]/simba/sparkodbc/lib/64 contains the 64-bit Simba Spark ODBC Driver for Linux and simba.sparkodbc.ini.

• [INSTALL_DIR]/simba/sparkodbc/ErrorMessages contains error message files required by the Simba Spark ODBC Driver.

• [INSTALL_DIR]/simba/sparkodbc/Setup contains configuration files named odbc.ini and odbcinst.ini.

Setting the LD_LIBRARY_PATH Environment Variable The LD_LIBRARY_PATH environment variable must include the paths to:

• Installed ODBC driver manager libraries

• Installed Simba ODBC Driver with SQL Connector for Apache Spark shared libraries

Important: While you can have both 32- and 64-bit versions of the driver installed at the same time on the same computer, do not include the paths to both 32- and 64-bit shared libraries in LD_LIBRARY PATH at the same time. Only include the path to the shared libraries corresponding to the driver matching the bitness of the client application used.

For example, if you are using a 64-bit client application and ODBC driver manager libraries are installed in /usr/local/lib, then set LD_LIBRARY_PATH as follows, where InstallDir is /opt if you installed the driver using the RPM or your installation directory if you installed the driver using the tarball: export LD_LIBRARY_PATH=/usr/local/lib:InstallDir/simba/sparkodbc/lib/64

Refer to your Linux shell documentation for details on how to set environment variables permanently.

Configuring ODBC Connections for Linux Files

ODBC driver managers use configuration files to define and configure ODBC data sources and drivers. By default, the following configuration files residing in the user’s home directory are used:

• .odbc.ini – The file used to define ODBC data sources (required)

• .odbcinst.ini – The file used to define ODBC drivers (optional)

By default, the simba.sparkodbc.ini in the /opt/simba/sparkodbc/lib/32 directory for 32-bit Linux driver, /opt/simba/sparkodbc/lib/64 directory for 64-bit Linux driver or /opt/simba/sparkodbc/lib/universal directory for Mac OS X driver is used to configure Simba ODBC Driver with SQL Connector for Apache Spark (required).

www.simba.com 14

Page 15: Simba ODBC Driver with SQL Connector for Apache Spark ... · Simba ODBC Driver with SQL Connector for Apache Spark is used for direct SQL and Spark SQL access to Apache Hadoop / Spark

Simba ODBC Driver with SQL Connector for Apache Spark Installation and Configuration Guide

Sample Files The driver installation contains the following sample configuration files in the Setup directory:

• odbc.ini

• odbcinst.ini

The names of the sample configuration files do not begin with a period (.) so that they will appear in directory listings by default. A filename beginning with a period (.) is hidden. For odbc.ini and odbcinst.ini, if the default location is used, then the filenames must begin with a period (.).

If the configuration files do not already exist in the user’s home directory, then the sample configuration files can be copied to that directory and renamed. If the configuration files already exist in the user’s home directory, then the sample configuration files should be used as a guide for modifying the existing configuration files.

Configuring the Environment By default, the ODBC configuration files (odbc.ini and odbcinst.ini) reside in the user’s home directory. However, two environment variables, ODBCINI and ODBCSYSINI, can be used to specify different locations for the odbc.ini and odbcinst.ini configuration files. Set ODBCINI to point to your odbc.ini file. Set ODBCSYSINI to point to the directory containing the odbcinst.ini file.

The driver’s configuration file (simba.sparkodbc.ini) is installed to the driver’s lib directory and used by the driver by default. The SIMBASPARKINI environment variable can be used to specify a different location for the simba.sparkodbc.ini file.

For example, if your odbc.ini and simba.sparkodbc.ini files are located in /etc and your odbcinst.ini file is located in /usr/local/odbc, then set the environment variables as follows: export ODBCINI=/etc/odbc.ini export ODBCSYSINI=/usr/local/odbc export SIMBASPARKINI=/etc/simba.sparkodbc.ini

The following search order is used to locate the simba.sparkodbc.ini file:

1. If the SIMBASPARKINI environment variable is defined, then the driver searches for the file specified by the environment variable.

Important: SIMBASPARKINI must contain the full path, including the file name.

2. The directory containing the driver’s binary is searched for a file named simba.sparkodbc.ini not beginning with a period.

3. The current working directory of the application is searched for a file named simba.sparkodbc.ini (not beginning with a period).

4. The directory ~/ (that is, $HOME) is searched for a hidden file named .simba.sparkodbc.ini

5. The directory /etc is searched for a file named simba.sparkodbc.ini (not beginning with a period).

www.simba.com 15

Page 16: Simba ODBC Driver with SQL Connector for Apache Spark ... · Simba ODBC Driver with SQL Connector for Apache Spark is used for direct SQL and Spark SQL access to Apache Hadoop / Spark

Simba ODBC Driver with SQL Connector for Apache Spark Installation and Configuration Guide

Configuring the odbc.ini File

ODBC Data Sources are defined in the odbc.ini configuration file. The file is divided into several sections:

• [ODBC] is optional and used to control global ODBC configuration, such as ODBC tracing.

• [ODBC Data Sources] is required, listing DSNs and associating DSNs with a driver.

• A section having the same name as the data source specified in the [ODBC Data Sources] section is required to configure the data source.

Here is an example odbc.ini configuration file for Linux: [ODBC Data Sources] Sample Simba Spark DSN 32=Simba Spark ODBC Driver 32-bit [Sample Simba Spark DSN 32] Driver=/opt/simba/sparkodbc/lib/32/libsimbasparkodbc32.so HOST=MySparkServer PORT=10000

To create a data source:

1. Open the .odbc.ini configuration file in a text editor.

2. Add a new entry to the [ODBC Data Sources] section. Type the data source name (DSN) and the driver name.

3. To set configuration options, add a new section having a name matching the data source name (DSN) you specified in step 2. Specify configuration options as key-value pairs.

4. Save the .odbc.ini configuration file.

For details on configuration options available to control the behavior of DSNs using Simba ODBC Driver with SQL Connector for Apache Spark, see “Appendix C: Driver Configuration Options” on page 27.

Configuring the odbcinst.ini File

ODBC Drivers are defined in the odbcinst.ini configuration file. The configuration file is optional because drivers can be specified directly in the odbc.ini configuration file, as described in “Configuring the odbc.ini File” on page 16.

The odbcinst.ini file is divided into the following sections:

• [ODBC Drivers] lists the names of all the installed ODBC drivers.

• A section having the same name as the driver name specified in the [ODBC Drivers] section lists driver attributes and values.

Here is an example odbcinst.ini file for Linux: [ODBC Drivers] Simba Spark ODBC Driver 32-bit=Installed Simba Spark ODBC Driver 64-bit=Installed [Simba Spark ODBC Driver 32-bit]

www.simba.com 16

Page 17: Simba ODBC Driver with SQL Connector for Apache Spark ... · Simba ODBC Driver with SQL Connector for Apache Spark is used for direct SQL and Spark SQL access to Apache Hadoop / Spark

Simba ODBC Driver with SQL Connector for Apache Spark Installation and Configuration Guide

Description=Simba Spark ODBC Driver (32-bit) Driver=/opt/simba/sparkodbc/lib/32/libsimbasparkodbc32.so [Simba Spark ODBC Driver 64-bit] Description=Simba Spark ODBC Driver (64-bit) Driver=/opt/simba/sparkodbc/lib/64/libsimbasparkodbc64.so

To define a driver:

1. Open the .odbcinst.ini configuration file in a text editor.

2. Add a new entry to the [ODBC Drivers] section. Type the driver name, and then type =Installed

Note: Assign the driver name as the value of the Driver attribute in the data source definition instead of the driver shared library name.

3. In .odbcinst.ini, add a new section having a name matching the driver name you typed in step 2, and then add configuration options to the section based on the sample odbcinst.ini file provided with Simba ODBC Driver with SQL Connector for Apache Spark in the Setup directory. Specify configuration options as key-value pairs.

4. Save the .odbcinst.ini configuration file.

Configuring the simba.sparkodbc.ini File To configure Simba ODBC Driver with SQL Connector for Apache Spark to work with your ODBC driver manager:

1. Open the .simba.sparkodbc.ini configuration file in a text editor.

2. Edit the DriverManagerEncoding setting. The value usually must be UTF-16 or UTF-32, depending on the ODBC driver manager you use. iODBC uses UTF-32 and unixODBC uses UTF-16. Consult your ODBC Driver Manager documentation for the correct setting to use.

3. Edit the ODBCInstLib setting. The value is the name of the ODBCInst shared library for the ODBC driver manager you use. The configuration file defaults to the shared library for iODBC. In Linux, the shared library name for iODBC is libiodbcinst.so.

Note: Consult your ODBC driver manager documentation for the correct library to specify. You can specify an absolute or relative filename for the library. If you intend to use the relative filename, then the path to the library must be included in the library path environment variable. In Linux, the library path environment variable is named LD_LIBRARY_PATH.

4. Save the .simba.sparkodbc.ini configuration file.

Configuring Authentication For details on selecting the appropriate authentication for a DSN using Simba ODBC Driver with SQL Connector for Apache Spark, see “Appendix A: Authentication Options” on page 22.

For details on the keys involved in configuring authentication, see “Appendix C: Driver Configuration Options” on page 27.

www.simba.com 17

Page 18: Simba ODBC Driver with SQL Connector for Apache Spark ... · Simba ODBC Driver with SQL Connector for Apache Spark is used for direct SQL and Spark SQL access to Apache Hadoop / Spark

Simba ODBC Driver with SQL Connector for Apache Spark Installation and Configuration Guide

Using No Authentication

To use no authentication:

Set the AuthMech configuration key for the DSN to 0

Using Kerberos

For information on operating Kerberos, refer to the documentation for your operating system.

To configure a DSN using Simba ODBC Driver with SQL Connector for Apache Spark to use Kerberos authentication:

1. Set the AuthMech configuration key for the DSN to 1

2. If your Kerberos setup does not define a default realm or if the realm of your Spark server is not the default, then set the appropriate realm using the KrbRealm key.

3. Set the KrbHostFQDN key to the fully qualified domain name of the Shark Server 2 or Spark Thrift Server host.

4. Set the KrbServiceName key to the service name of the Shark Server 2 or Spark Thrift Server.

Using User Name

To configure User Name authentication:

1. Set the AuthMech configuration key for the DSN to 2

2. Set the UID key to the appropriate user name recognized by the Spark server.

Using User Name and Password

To configure User Name and Password authentication:

1. Set the AuthMech configuration key for the DSN to 3

2. Set the UID key to the appropriate user name recognized by the Spark server.

3. Set the PWD key to the password corresponding to the user name you provided in step 2.

Features SQL Query versus Spark SQL Query

The native query language supported by Spark is Spark SQL. For simple queries, Spark SQL is a subset of SQL-92. However, for most applications, the syntax is different enough that most applications do not work with native Spark SQL.

www.simba.com 18

Page 19: Simba ODBC Driver with SQL Connector for Apache Spark ... · Simba ODBC Driver with SQL Connector for Apache Spark is used for direct SQL and Spark SQL access to Apache Hadoop / Spark

Simba ODBC Driver with SQL Connector for Apache Spark Installation and Configuration Guide

SQL Connector To bridge the difference between SQL and Spark SQL, the SQL Connector feature translates standard SQL-92 queries into equivalent Spark SQL queries. The SQL Connector performs syntactical translations and structural transformations. For example:

• Quoted Identifiers—When quoting identifiers, Spark SQL uses backquotes (`) while SQL uses double quotes ("). Even when a driver reports the backquote as the quote character, some applications still generate double-quoted identifiers.

• Table Aliases—Spark SQL does not support the AS keyword between a table reference and its alias.

• JOIN, INNER JOIN and CROSS JOIN—SQL INNER JOIN and CROSS JOIN syntax is translated to Spark SQL JOIN syntax.

• TOP N/LIMIT—SQL TOP N queries are transformed to Spark SQL LIMIT queries.

Data Types The following data types are supported:

• TINYINT

• SMALLINT

• INT

• BIGINT

• FLOAT

• DOUBLE

• BOOLEAN

• STRING

• TIMESTAMP

Note: The aggregate types (ARRAY, MAP and STRUCT) are not yet supported. Columns of aggregate types are treated as STRING columns.

Catalog and Schema Support Simba ODBC Driver with SQL Connector for Apache Spark supports both catalogs and schemas in order to make it easy for the driver to work with various ODBC applications. Since Spark only organizes tables into schema/database, we have added a synthetic catalog named SPARK under which all of the schemas/databases are organized. The driver also maps the ODBC schema to the Spark schema/database.

Spark System Table A pseudo-table called SPARK_SYSTEM can be used to query for Spark cluster system environment information. The pseudo table is under the pseudo schema SPARK_SYSTEM. The table has two

www.simba.com 19

Page 20: Simba ODBC Driver with SQL Connector for Apache Spark ... · Simba ODBC Driver with SQL Connector for Apache Spark is used for direct SQL and Spark SQL access to Apache Hadoop / Spark

Simba ODBC Driver with SQL Connector for Apache Spark Installation and Configuration Guide

String type columns ENVKEY and ENVVALUE. Standard SQL can be executed against the Spark system table. For example: SELECT * FROM SPARK_SYSTEM.SPARK_SYSTEM WHERE ENVKEY LIKE ‘%spark%’

The example query returns all of the Spark system environment entries whose key contains the word “spark.” A special query, “set –v”, is executed to fetch system environment information and is not supported by all Spark versions. For versions of Spark that do not support querying system environment information, the driver returns an empty result set.

Server-side Properties The Simba ODBC Driver with SQL Connector for Apache Spark allows you to set server-side properties via a DSN. Server-side properties specified in a DSN affect only the connection established using the DSN.

For details on setting server-side properties for a DSN, see Configuring ODBC Connections on page 8.

Get Tables With Query Shark Server 2 has a limit on the number of tables in a database when handling the GetTables API call. When the number of tables in a database is above the limit it would either run into stack overflow error or timeout error. The exact limit and error depends on the JVM settings.

To address this issue, a workaround is implemented in the driver to avoid using the GetTables API call when connecting to Shark Server 2 or Spark Thrift Server. The feature can be enabled and disabled via the Get Tables With Query (GetTablesWithQuery) configuration setting.

Known Issues in Spark The following are known issues in Apache Spark that you might encounter while using the driver.

Backquotes in aliases are not handled correctly In Spark 1.0.x and 1.1.x, the backquotes (`) surrounding identifiers are returned as part of the column names in the result set metadata. Backquotes are used to quote identifiers in Spark SQL, and should not be considered as part of the identifier.

This issue is fixed in Spark 1.2.x.

For more information, see the following JIRA issue from Apache: https://issues.apache.org/jira/browse/SPARK-3708

Filtering on TIMESTAMP columns does not return any rows In Spark 0.9.x, 1.0.x, and 1.1.0, using the WHERE statement to filter TIMESTAMP columns does not return any rows.

www.simba.com 20

Page 21: Simba ODBC Driver with SQL Connector for Apache Spark ... · Simba ODBC Driver with SQL Connector for Apache Spark is used for direct SQL and Spark SQL access to Apache Hadoop / Spark

Simba ODBC Driver with SQL Connector for Apache Spark Installation and Configuration Guide

This issue is fixed in Spark 1.1.1 and later.

For more information, see the following JIRA issue from Apache: https://issues.apache.org/jira/browse/SPARK-3173

Cannot use AND to combine a TIMESTAMP column filter with another filter

In Spark 1.1.x, when you execute a query that uses the AND operator to combine a TIMESTAMP column filter with another filter, an error occurs.

As a workaround, use a subquery, as shown in the following example: SELECT * FROM (SELECT * FROM timestamp_table WHERE (keycolumn='TimestampMicroSeconds')) s1 WHERE (column1 = '1955-10-11 11:10:33.123456');

Contact Us If you have difficulty using the driver, please contact our Technical Support staff. We welcome your questions, comments and feature requests.

Technical Support is available Monday to Friday from 8 a.m. to 5 p.m. Pacific Time.

Important: To help us assist you, prior to contacting Technical Support please prepare a detailed summary of the client and server environment including operating system version, patch level and configuration.

You can contact Technical Support via:

• E-mail: [email protected]

• Web site: www.simba.com

• Telephone: (604) 633-0008 Extension 3

• Fax: (604) 633-0004

You can also follow us on Twitter @SimbaTech

www.simba.com 21

Page 22: Simba ODBC Driver with SQL Connector for Apache Spark ... · Simba ODBC Driver with SQL Connector for Apache Spark is used for direct SQL and Spark SQL access to Apache Hadoop / Spark

Simba ODBC Driver with SQL Connector for Apache Spark Installation and Configuration Guide

Appendix A: Authentication Options Shark Server supports the following authentication mechanisms:

• No Authentication

Shark Server 2 and Spark Thrift Server supports the following authentication mechanisms:

• No Authentication

• Kerberos

• User Name

• User Name and Password

Note: Kerberos, User Name, and User Name and Password authentication mechanisms are available only for Shark Server 2 and Spark Thrift Server distributions.

To determine the authentication mechanism configured for your Shark Server 2 or Spark Thrift Server, examine your hive-site.xml file. Examine the following properties to determine which authentication mechanism your server is set to use:

• hive.server2.authentication • hive.server2.enable.doAs

hive.server2.authentication hive.server2.enable.doAs Driver Authentication Mechanism

NOSASL False No Authentication

KERBEROS True or False Kerberos

NONE True or False User Name

Note: It is an error to set hive.server2.authentication to NOSASL and hive.server2.enable.doAs to true. This configuration will not prevent the service from starting up but results in an unusable service.

For more detail on authentication mechanisms, see the documentation for your Hadoop / Spark distribution. See also “Running Hadoop in Secure Mode” at http://hadoop.apache.org/docs/r0.23.7/hadoop-project-dist/hadoop-common/ClusterSetup.html#Running_Hadoop_in_Secure_Mode

Using No Authentication When hive.server2.authentication is set to NOSASL, you must configure your connection to use No Authentication.

www.simba.com 22

Page 23: Simba ODBC Driver with SQL Connector for Apache Spark ... · Simba ODBC Driver with SQL Connector for Apache Spark is used for direct SQL and Spark SQL access to Apache Hadoop / Spark

Simba ODBC Driver with SQL Connector for Apache Spark Installation and Configuration Guide

Using Kerberos When connecting to a Spark server of type Shark Server 2 or Spark Thrift Server and hive.server2.authentication is set to KERBEROS, then you must configure your connection to use Kerberos.

Using User Name When connecting to a Spark server of type Shark Server 2 or Spark Thrift Server and hive.server2.authentication is set to NONE, then you must configure your connection to use User Name. Validation of the credentials that you include depends on hive.server2.enable.doAs:

• If hive.server2.enable.doAs is set to true, then the User Name in the DSN or driver configuration must be an existing OS user on the host running Shark Server 2 or Spark Thrift Server.

• If hive.server2.enable.doAs is set to false, then the User Name in the DSN or driver configuration is ignored.

If the User Name in the DSN or driver configuration is not supplied, then the driver defaults to using “anonymous” as the user name.

Using User Name and Password When connecting to a Spark server of type Shark Server 2 or Spark Thrift Server that is configured to use SASL-PLAIN authentication with user name and password, then you must configure your connection to use User Name and Password

www.simba.com 23

Page 24: Simba ODBC Driver with SQL Connector for Apache Spark ... · Simba ODBC Driver with SQL Connector for Apache Spark is used for direct SQL and Spark SQL access to Apache Hadoop / Spark

Simba ODBC Driver with SQL Connector for Apache Spark Installation and Configuration Guide

Appendix B: Configuring Kerberos Authentication for Windows MIT Kerberos

Download and install MIT Kerberos for Windows 4.0.1 1. For 64-bit computers: http://web.mit.edu/kerberos/dist/kfw/4.0/kfw-4.0.1-amd64.msi. The

installer includes both 32-bit and 64-bit libraries.

2. For 32-bit computers: http://web.mit.edu/kerberos/dist/kfw/4.0/kfw-4.0.1-i386.msi. The installer includes 32-bit libraries only.

Set up the Kerberos configuration file in the default location 1. Obtain a krb5.conf configuration file from your Kerberos administrator. The configuration

file should also be present at /etc/krb5.conf on the machine hosting the Shark Server 2 or Spark Thrift Server.

2. The default location is C:\ProgramData\MIT\Kerberos5 but this is normally a hidden directory. Consult your Windows documentation if you wish to view and use this hidden directory.

3. Rename the configuration file from krb5.conf to krb5.ini.

4. Copy krb5.ini to the default location and overwrite the empty sample file.

Consult the MIT Kerberos documentation for more information on configuration.

Set up the Kerberos configuration file in another location

If you do not want to put the Kerberos configuration file in the default location then you can use another location. The steps required to do this are as follows:

1. Obtain a krb5.conf configuration file for your Kerberos setup.

2. Store krb5.conf in an accessible directory and make note of the full path name.

3. Click the Windows Start menu.

4. Right-click Computer.

5. Click Properties.

6. Click Advanced system settings.

7. Click Environment Variables.

8. Click New for System variables.

9. In the Variable Name field, type KRB5_CONFIG.

10. In the Variable Value field, type the absolute path to the krb5.conf file you stored in step 2.

www.simba.com 24

Page 25: Simba ODBC Driver with SQL Connector for Apache Spark ... · Simba ODBC Driver with SQL Connector for Apache Spark is used for direct SQL and Spark SQL access to Apache Hadoop / Spark

Simba ODBC Driver with SQL Connector for Apache Spark Installation and Configuration Guide

11. Click OK to save the new variable.

12. Ensure the variable is listed in the System variables list.

13. Click OK to close Environment Variables Window.

14. Click OK to close System Properties Window.

Set up the Kerberos credential cache file 1. Create a new directory where you want to save the Kerberos credential cache file. For

example, you may create the directory C:\temp

2. Click the Windows Start menu.

3. Right-click Computer.

4. Click Properties.

5. Click Advanced system settings.

6. Click Environment Variables.

7. Click New for System variables.

8. In the Variable Name field, type KRB5CCNAME

9. In the Variable Value field, type the path to the folder you created in step 1, and then append the file name krb5cache. For example, if you created the folder C:\temp in step 1, then type C:\temp\krb5cache

Note: krb5cache is a file—not a directory—managed by the Kerberos software and should not be created by the user. If you receive a permission error when you first use Kerberos, check to ensure that the krb5cache file does not exist as a file or a directory.

10. Click OK to save the new variable.

11. Ensure the variable appears in the System variables list.

12. Click OK to close Environment Variables Window.

13. Click OK to close System Properties Window.

14. Restart your computer to ensure that MIT Kerberos for Windows uses the new settings.

Obtain a ticket for a Kerberos principal using password

Note: If your Kerberos environment uses keytab files please see the next section.

1. Click the Start button .

2. Click All Programs.

3. Click the Kerberos for Windows (64-bit) or the Kerberos for Windows (32-bit) program group.

4. Use MIT Kerberos Ticket Manager to obtain a ticket for the principal that will be connecting to Shark Server 2 or Spark Thrift Server.

www.simba.com 25

Page 26: Simba ODBC Driver with SQL Connector for Apache Spark ... · Simba ODBC Driver with SQL Connector for Apache Spark is used for direct SQL and Spark SQL access to Apache Hadoop / Spark

Simba ODBC Driver with SQL Connector for Apache Spark Installation and Configuration Guide

Obtain a ticket for a Kerberos principal using a keytab file 1. Click the Start button .

2. Click All Programs.

3. Click Accessories.

4. Click Command Prompt.

5. Type: kinit -k -t <keytab pathname> <principal>

<keytab pathname> is the full pathname to the keytab file. For example, C:\mykeytabs\sparkserver2.keytab

<principal> is the Kerberos principal to use for authentication. For example, spark/[email protected]

Obtain a ticket for a Kerberos principal using the default keytab file A default keytab file can be set for your Kerberos configuration. Consult the MIT Kerberos documentation for instructions on configuring a default keytab file.

1. Click the Start button .

2. Click All Programs.

3. Click Accessories.

4. Click Command Prompt.

5. Type: kinit -k <principal>

<principal> is the Kerberos principal to use for authentication. For example, spark/[email protected]

www.simba.com 26

Page 27: Simba ODBC Driver with SQL Connector for Apache Spark ... · Simba ODBC Driver with SQL Connector for Apache Spark is used for direct SQL and Spark SQL access to Apache Hadoop / Spark

Simba ODBC Driver with SQL Connector for Apache Spark Installation and Configuration Guide

Appendix C: Driver Configuration Options The configuration options available to control the behavior of Simba ODBC Driver with SQL Connector for Apache Spark are listed and described in Table 1.

Note: You can pass in these configuration options in your connection string or set them in your odbc.ini and .simba.sparkodbc.ini files. Configuration options set in a .simba.sparkodbc.ini file apply to all connections, whereas configuration options passed in in the connection string or set in an odbc.ini file are specific to a connection. Configuration options passed in in the connection string take precedence over configuration options set in odbc.ini, and configuration options set in odbc.ini take precedence over configuration options set in .simba.sparkodbc.ini

Key Default Value Description

Driver The name of the installed driver (Simba Spark ODBC Driver) or the absolute path of the Simba ODBC Driver with SQL Connector for Apache Spark shared object file. (Required)

HOST The IP address or hostname of the Spark server. (Required)

PORT 10000

The listening port for the service. (Required)

Schema

default The name of the database schema to use when a schema is not explicitly specified in a query Note: Queries on other schemas can still be issued by explicitly specifying the schema in the query. To determine the appropriate database schema to use, type show databases command at the Spark command prompt to inspect your databases. (Optional)

DefaultStringColumnLength 255 The maximum data length for string columns. Note: Spark does not provide the maximum data length for String columns in the columns metadata. This option allows you to tune the maximum data length for String columns. (Optional)

www.simba.com 27

Page 28: Simba ODBC Driver with SQL Connector for Apache Spark ... · Simba ODBC Driver with SQL Connector for Apache Spark is used for direct SQL and Spark SQL access to Apache Hadoop / Spark

Simba ODBC Driver with SQL Connector for Apache Spark Installation and Configuration Guide

Key Default Value Description

BinaryColumnLength 32767 The maximum data length for binary columns. Note: Spark does not provide the maximum data length for Binary columns in the columns metadata. The option allows you to tune the maximum data length for Binary columns. (Optional)

UseNativeQuery 0 Enabling the UseNativeQuery option using a value of 1 disables the SQL Connector feature. The SQL Connector feature has been added to the driver to apply transformations to the queries emitted by an application to convert them into an equivalent form in Spark SQL. If the application is Spark-aware and already emits Spark SQL, then turning off the SQL Connector feature avoids the extra overhead of query transformation. (Optional)

FastSQLPrepare 0 To enable the FastSQLPrepare option, use a value of 1. Enabling FastSQLPrepare defers query execution to SQLExecute. When using Native Query mode, the driver will execute the Spark SQL query to retrieve the result set metadata for SQLPrepare. As a result, SQLPrepare might be slow. If the result set metadata is not required after calling SQLPrepare, then enable FastSQLPrepare. (Optional)

RowsFetchedPerBlock 10000 The maximum number of rows that a query returns at a time. Any positive 32-bit integer is a valid value but testing has shown that performance gains are marginal beyond the default value of 10000 rows. (Optional)

www.simba.com 28

Page 29: Simba ODBC Driver with SQL Connector for Apache Spark ... · Simba ODBC Driver with SQL Connector for Apache Spark is used for direct SQL and Spark SQL access to Apache Hadoop / Spark

Simba ODBC Driver with SQL Connector for Apache Spark Installation and Configuration Guide

Key Default Value Description

DecimalColumnScale 10 The maximum number of digits to the right of the decimal point for numeric data types. (Optional)

SSP_ To set a server-side property, use the following syntax where SSPKey is the name of the server-side property to set and SSPValue is the value to assign to the server-side property: SSP_SSPKey=SSPValue For example: SSP_some.key.name=myValue After the driver applies the server-side property, the SSP_ prefix is removed from the DSN entry leaving an entry of SSPKey=SSPValue (Optional)

ApplySSPWithQueries 1 When set to the default value of 1—enabled—each server side property you set is applied by executing a set SSPKey=SSPValue query when opening a session to the Spark server. Applying server-side properties using queries involves an additional network round trip per server side property when establishing a session to the Spark server. Some Shark Server 2 or Spark Thrift Server builds are not compatible with the more efficient method for setting server-side properties that the driver uses when ApplySSPWithQueries is disabled by setting the key value to 0. Note: When connecting to a Shark Server, ApplySSPWithQueries is always enabled. (Optional)

www.simba.com 29

Page 30: Simba ODBC Driver with SQL Connector for Apache Spark ... · Simba ODBC Driver with SQL Connector for Apache Spark is used for direct SQL and Spark SQL access to Apache Hadoop / Spark

Simba ODBC Driver with SQL Connector for Apache Spark Installation and Configuration Guide

Key Default Value Description

LCaseSspKeyName 1 Control whether the driver will convert server side property key name to all lower case characters. Set to 1 to enable. Set to 0 to disable. (Optional)

SparkServerType 1 The Spark server type. Set to 1 for Shark Server. Set to 2 for Shark Server 2. Set to 3 for Spark Thrift Server. (Optional)

AuthMech

0 The authentication mechanism to use. Set the value to 0 for no authentication, 1 for Kerberos, 2 for User Name, 3 for User Name and Password. (Optional)

KrbHostFQDN The fully qualified domain name of the Shark Server 2 or Spark Thrift Server host used. (Required if AuthMech is Kerberos)

KrbServiceName

The Kerberos service principal name of the Shark Server 2 or Spark Thrift Server. (Required if AuthMech is Kerberos)

KrbRealm Depends on Kerberos configuration.

If there is no default realm configured or the realm of the Shark Server 2 or Spark Thrift Server host is different from the default realm for your Kerberos setup, then define the realm of the Shark Server 2 or Spark Thrift Server host using this option. (Optional)

www.simba.com 30

Page 31: Simba ODBC Driver with SQL Connector for Apache Spark ... · Simba ODBC Driver with SQL Connector for Apache Spark is used for direct SQL and Spark SQL access to Apache Hadoop / Spark

Simba ODBC Driver with SQL Connector for Apache Spark Installation and Configuration Guide

Key Default Value Description

UID The user name of an existing user on the host running Shark Server 2 or Spark Thrift Server. Important: You must set the hive.server2.authentication property in the hive-site.xml file for the Shark Server 2 or Spark Thrift Server to NONE

OR The user name set up for Shark Server 2 or Spark Thrift Server when using the User Name and Password authentication. (Required if AuthMech is User Name and Password)

PWD The password set up for User Name and Password authentication mechanism. (Required if AuthMech is User Name and Password)

GetTablesWithQuery 0 Control whether to retrieve the names of tables in a database using the SHOW TABLES query instead of the GetTables Thrift API call. Note: This setting is only applicable when connecting to Shark Server 2 or Spark Thrift Server. Set to 1 to enable. Set to 0 to disable. (Optional)

Table 1 Driver Configuration Options

www.simba.com 31