SimbaODBCDriverwithSQL ConnectorforApacheSpark ... · TableofContents Introduction 9 WindowsDriver...

59
Simba ODBC Driver with SQL Connector for Apache Spark Installation and Configuration Guide Simba Technologies Inc. April 2, 2015

Transcript of SimbaODBCDriverwithSQL ConnectorforApacheSpark ... · TableofContents Introduction 9 WindowsDriver...

Simba ODBC Driver with SQLConnector for Apache Spark

Installation and ConfigurationGuide

Simba Technologies Inc.

April 2, 2015

Copyright © 2015 Simba Technologies Inc. All Rights Reserved.

Information in this document is subject to change without notice. Companies, names anddata used in examples herein are fictitious unless otherwise noted. No part of thispublication, or the software it describes, may be reproduced, transmitted, transcribed,stored in a retrieval system, decompiled, disassembled, reverse-engineered, or translatedinto any language in any form by any means for any purpose without the express writtenpermission of Simba Technologies Inc.

Trademarks

Simba, the Simba logo, SimbaEngine, SimbaEngine C/S, SimbaExpress and SimbaLibare registered trademarks of Simba Technologies Inc. All other trademarks and/orservicemarks are the property of their respective owners.

Contact Us

Simba Technologies Inc.938 West 8th AvenueVancouver, BC CanadaV5Z 1E5

Tel: +1 (604) 633-0008

Fax: +1 (604) 633-0004

www.simba.com

Cyrus SASL

Copyright (c) 1998-2003 Carnegie Mellon University. All rights reserved.

Redistribution and use in source and binary forms, with or without modification, arepermitted provided that the following conditions are met:1. Redistributions of source code must retain the above copyright notice, this list of

conditions and the following disclaimer.2. Redistributions in binary form must reproduce the above copyright notice, this list of

conditions and the following disclaimer in the documentation and/or other materialsprovided with the distribution.

3. The name "Carnegie Mellon University" must not be used to endorse or promoteproducts derived from this software without prior written permission. For permissionor any other legal details, please contact:Office of Technology TransferCarnegie Mellon University5000 Forbes AvenuePittsburgh, PA 15213-3890(412) 268-4387, fax: (412) [email protected]

4. Redistributions of any form whatsoever must retain the following acknowledgment:

www.simba.com 2

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide

"This product includes software developed by Computing Services at CarnegieMellon University (http://www.cmu.edu/computing/)."

CARNEGIE MELLON UNIVERSITY DISCLAIMS ALL WARRANTIES WITH REGARDTO THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OFMERCHANTABILITY AND FITNESS, IN NO EVENT SHALL CARNEGIE MELLONUNIVERSITY BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIALDAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE,DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OROTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USEOR PERFORMANCE OF THIS SOFTWARE.

ICU License - ICU 1.8.1 and later

COPYRIGHT AND PERMISSION NOTICE

Copyright (c) 1995-2010 International Business Machines Corporation and others. Allrights reserved.

Permission is hereby granted, free of charge, to any person obtaining a copy of thissoftware and associated documentation files (the "Software"), to deal in the Softwarewithout restriction, including without limitation the rights to use, copy, modify, merge,publish, distribute, and/or sell copies of the Software, and to permit persons to whom theSoftware is furnished to do so, provided that the above copyright notice(s) and thispermission notice appear in all copies of the Software and that both the above copyrightnotice(s) and this permission notice appear in supporting documentation.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OFMERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE ANDNONINFRINGEMENT OF THIRD PARTY RIGHTS. IN NO EVENT SHALL THECOPYRIGHT HOLDER OR HOLDERS INCLUDED IN THIS NOTICE BE LIABLE FORANY CLAIM, OR ANY SPECIAL INDIRECT OR CONSEQUENTIAL DAMAGES, ORANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA ORPROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHERTORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE ORPERFORMANCE OF THIS SOFTWARE.

Except as contained in this notice, the name of a copyright holder shall not be used inadvertising or otherwise to promote the sale, use or other dealings in this Software withoutprior written authorization of the copyright holder.

All trademarks and registered trademarks mentioned herein are the property of theirrespective owners.

OpenSSL

Copyright (c) 1998-2008 The OpenSSL Project. All rights reserved.

Redistribution and use in source and binary forms, with or without modification, arepermitted provided that the following conditions are met:

www.simba.com 3

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide

1. Redistributions of source code must retain the above copyright notice, this list ofconditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list ofconditions and the following disclaimer in the documentation and/or other materialsprovided with the distribution.

3. All advertising materials mentioning features or use of this software must display thefollowing acknowledgment:"This product includes software developed by the OpenSSL Project for use in theOpenSSL Toolkit. (http://www.openssl.org/)"

4. The names "OpenSSL Toolkit" and "OpenSSL Project" must not be used to endorseor promote products derived from this software without prior written permission. Forwritten permission, please contact [email protected].

5. Products derived from this software may not be called "OpenSSL" nor may"OpenSSL" appear in their names without prior written permission of the OpenSSLProject.

6. Redistributions of any form whatsoever must retain the following acknowledgment:"This product includes software developed by the OpenSSL Project for use in theOpenSSL Toolkit (http://www.openssl.org/)"

THIS SOFTWARE IS PROVIDED BY THE OpenSSL PROJECT "AS IS" AND ANYEXPRESSED OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THEIMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULARPURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE OpenSSL PROJECT OR ITSCONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITEDTO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ONANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, ORTORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OFTHE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCHDAMAGE.

Apache Spark

Copyright 2008-2011 The Apache Software Foundation.

Apache Thrift

Copyright 2006-2010 The Apache Software Foundation.

Expat

Copyright (c) 1998, 1999, 2000 Thai Open Source Software Center Ltd

Permission is hereby granted, free of charge, to any person obtaining a copy of thissoftware and associated documentation files (the "Software"), to deal in the Softwarewithout restriction, including without limitation the rights to use, copy, modify, merge,

www.simba.com 4

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide

publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons towhom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies orsubstantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OFMERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE ANDNOINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHTHOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHERDEALINGS IN THE SOFTWARE.

libcurl

COPYRIGHT AND PERMISSION NOTICE

Copyright (c) 1996 - 2012, Daniel Stenberg, <[email protected]>.

All rights reserved.

Permission to use, copy, modify, and distribute this software for any purpose with orwithout fee is hereby granted, provided that the above copyright notice and this permissionnotice appear in all copies.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OFMERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE ANDNONINFRINGEMENT OF THIRD PARTY RIGHTS. IN NO EVENT SHALL THEAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OROTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OROTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWAREOR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Except as contained in this notice, the name of a copyright holder shall not be used inadvertising or otherwise to promote the sale, use or other dealings in this Software withoutprior written authorization of the copyright holder.

www.simba.com 5

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide

About This Guide

Purpose

The Simba ODBC Driver with SQL Connector for Apache Spark Installation andConfiguration Guide explains how to install and configure the Simba ODBC Driver withSQL Connector for Apache Spark on all supported platforms. The guide also providesdetails related to features of the driver.

Audience

The guide is intended for end users of the Simba ODBC Driver with SQL Connector forApache Spark, as well as administrators and developers implementing the driver.

Knowledge Prerequisites

To use the Simba ODBC Driver with SQL Connector for Apache Spark, the followingknowledge is helpful:

l Familiarity with the platform on which you are using the Simba ODBC Driver withSQL Connector for Apache Spark

l Ability to use the data source to which the Simba ODBC Driver with SQL Connectorfor Apache Spark is connecting

l An understanding of the role of ODBC technologies and driver managers in con-necting to a data source

l Experience creating and configuring ODBC connectionsl Exposure to SQL

Document Conventions

Italics are used when referring to book and document titles.

Bold is used in procedures for graphical user interface elements that a user clicks and textthat a user types.

Monospace font indicates commands, source code or contents of text files.

Underline is not used.

The pencil icon indicates a short note appended to a paragraph.

The star icon indicates an important comment related to the preceding paragraph.

The thumbs up icon indicates a practical tip or suggestion.

www.simba.com 6

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide

Table of Contents

Introduction 9

Windows Driver 10System Requirements 10Installing the Driver 10Verifying the Version Number 11Creating a Data Source Name 11Configuring Authentication 12Configuring Advanced Options 17Configuring Server-Side Properties 18

Linux Driver 20System Requirements 20Installing the Driver Using the RPM 20Installing the Driver Using the Tarball Package 22Verifying the Version Number 22Setting the LD_LIBRARY_PATH Environment Variable 22

Mac OS X Driver 24System Requirements 24Installing the Driver 24Verifying the Version Number 25Setting the DYLD_LIBRARY_PATH Environment Variable 25

Configuring ODBC Connections for Non-Windows Platforms 26Files 26Sample Files 27Configuring the Environment 27Configuring the odbc.ini File 28Configuring the odbcinst.ini File 29Configuring the simba.sparkodbc.ini File 30Configuring Authentication 31

Features 35SQL Query versus Spark SQL Query 35SQL Connector 35Data Types 35Catalog and Schema Support 36

www.simba.com 7

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide

spark_system Table 37Server-Side Properties 37Get Tables With Query 37

Known Issues in Spark 38Backquotes in Aliases Are Not Handled Correctly 38Filtering on TIMESTAMP Columns Does Not Return Any Rows 38Cannot Use AND to Combine a TIMESTAMP Column Filter with Another Filter 38

Contact Us 39

Appendix A Authentication Options 40Using No Authentication 41Using Kerberos 41Using User Name 41Using User Name and Password 42Using User Name and Password (SSL) 42Using Windows Azure HDInsight Emulator 42Using Windows Azure HDInsight Service 42Using HTTP 42Using HTTPS 42

Appendix B Configuring Kerberos Authentication for Windows 43MIT Kerberos 43

Appendix C Driver Configuration Options 47Configuration Options Appearing in the User Interface 47Configuration Options Having Only Key Names 58

www.simba.com 8

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide

Introduction

The Simba ODBC Driver with SQL Connector for Apache Spark is used for direct SQL andSpark SQL access to Apache Hadoop / Spark distributions, enabling Business Intelligence(BI), analytics, and reporting on Hadoop-based data. The driver efficiently transforms anapplication’s SQL query into the equivalent form in Spark SQL, which is a subset of SQL-92. If an application is Spark-aware, then the driver is configurable to pass the querythrough to the database for processing. The driver interrogates Spark to obtain schemainformation to present to a SQL-based application. Queries, including joins, are translatedfrom SQL to Spark SQL. For more information about the differences between Spark SQLand SQL, see Features on page 35.

The Simba ODBC Driver with SQL Connector for Apache Spark complies with the ODBC3.52 data standard and adds important functionality such as Unicode and 32- and 64-bitsupport for high-performance computing environments.

ODBC is one the most established and widely supported APIs for connecting to andworking with databases. At the heart of the technology is the ODBC driver, which connectsan application to the database. For more information about ODBC, seehttp://www.simba.com/resources/data-access-standards-library. For completeinformation about the ODBC specification, see the ODBC API Reference athttp://msdn.microsoft.com/en-us/library/windows/desktop/ms714562(v=vs.85).aspx.

The Installation and Configuration Guide is suitable for users who are looking to accessdata residing within Hadoop from their desktop environment. Application developers mayalso find the information helpful. Refer to your application for details on connecting viaODBC.

www.simba.com 9

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide

Windows Driver

System Requirements

You install the Simba ODBC Driver with SQL Connector for Apache Spark on clientcomputers accessing data in a Hadoop cluster with the Spark service installed andrunning. Each computer where you install the driver must meet the following minimumsystem requirements:

l One of the following operating systems (32- and 64-bit editions are supported):o Windows® XP with SP3o Windows® Vistao Windows® 7 Professional and Enterpriseo Windows® 8 Pro and Enterpriseo Windows® Server 2008 R2

l 25 MB of available disk space

To install the driver, you must have Administrator privileges on the computer.

The driver is suitable for use with all versions of Apache Spark.

Installing the Driver

On 64-bit Windows operating systems, you can execute 32- and 64-bit applicationstransparently. You must use the version of the driver matching the bitness of the clientapplication accessing data in Hadoop / Spark:

l SimbaSparkODBC32.msi for 32-bit applicationsl SimbaSparkODBC64.msi for 64-bit applications

You can install both versions of the driver on the same computer.

For an explanation of how to use ODBC on 64-bit editions of Windows, seehttp://www.simba.com/wp-content/uploads/2010/10/HOW-TO-32-bit-vs-64-bit-ODBC-Data-Source-Administrator.pdf

To install the Simba ODBC Driver with SQL Connector for Apache Spark:1. Depending on the bitness of your client application, double-click to run

SimbaSparkODBC32.msi or SimbaSparkODBC64.msi2. Click Next3. Select the check box to accept the terms of the License Agreement if you agree, and

then click Next4. To change the installation location, click Change, then browse to the desired folder,

and then click OK. To accept the installation location, click Next

www.simba.com 10

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide

5. Click Install6. When the installation completes, click Finish7. If you received a license file via e-mail, then copy the license file into the \lib

subfolder in the installation folder you selected in step 4.To avoid security issues, you may need to save the license file on your localcomputer prior to copying the file into the \lib subfolder.

Verifying the Version Number

If you need to verify the version of the Simba ODBC Driver with SQL Connector forApache Spark that is installed on your Windows machine, you can find the version numberin the ODBC Data Source Administrator.

To verify the version number:1. Click the Start button , then click All Programs, then click the Simba Spark ODBC

Driver 1.0 program group corresponding to the bitness of the client applicationaccessing data in Hadoop / Spark, and then click ODBC Administrator

2. In the ODBC Data Source Administrator, click the Drivers tab and then find theSimba Spark ODBC Driver in the list of ODBC drivers that are installed on yoursystem. The version number is displayed in the Version column.

Creating a Data Source Name

After installing the Simba ODBC Driver with SQL Connector for Apache Spark, you needto create a Data Source Name (DSN).

To create a Data Source Name:1. Click the Start button , then click All Programs, then click the Simba Spark ODBC

Driver 1.0 program group corresponding to the bitness of the client applicationaccessing data in Hadoop / Spark, and then click ODBC Administrator

2. In the ODBC Data Source Administrator, click the Drivers tab, and then scroll downas needed to confirm that the Simba Spark ODBC Driver appears in the alphabeticallist of ODBC drivers that are installed on your system.

3. To create a DSN that only the user currently logged into Windows can use, click theUser DSN tab.

ORTo create a DSN that all users who log into Windows can use, click the System DSNtab.

4. Click Add5. In the Create New Data Source dialog box, select Simba Spark ODBC Driver and

then click Finish6. Use the options in the Simba Spark ODBC Driver DSN Setup dialog box to configure

your DSN:

www.simba.com 11

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide

a) In the Data Source Name field, type a name for your DSN.b) Optionally, in the Description field, type relevant details about the DSN.c) In the Host field, type the IP address or host name of the Spark server.d) In the Port field, type the number of the TCP port on which the Spark server is

listening.e) In the Database field, type the name of the database schema to use when a

schema is not explicitly specified in a query.You can still issue queries on other schemas by explicitly specifyingthe schema in the query. To inspect your databases and determinethe appropriate schema to use, type the show databases commandat the Spark command prompt.

f) In the Spark Server Type list, select the appropriate server type for the versionof Spark that you are running:

l If you are running Shark 0.8.1 or earlier, then select SharkServerl If you are running Shark 0.9.*, then select SharkServer2l If you are running Spark 1.1 or later, then select SparkThriftServer

g) In the Authentication area, configure authentication as needed. For moreinformation, see Configuring Authentication on page 12.

Shark Server does not support authentication. Most defaultconfigurations of Shark Server 2 or Spark Thrift Server require UserName authentication. To verify the authentication mechanism thatyou need to use for your connection, check the configuration of yourHadoop / Spark distribution. For more information, see AuthenticationOptions on page 40.

h) To configure advanced driver options, click Advanced Options. For moreinformation, see Configuring Advanced Options on page 17.

i) To configure server-side properties, click Advanced Options and then clickServer Side Properties. For more information, see Configuring Server-SideProperties on page 18.

7. To test the connection, click Test. Review the results as needed, and then click OK.If the connection fails, then confirm that the settings in the Simba SparkODBC Driver DSN Setup dialog box are correct. Contact your Spark serveradministrator as needed.

8. To save your settings and close the Simba Spark ODBC Driver DSN Setup dialogbox, click OK

9. To close the ODBC Data Source Administrator, click OK

Configuring Authentication

For information about selecting the appropriate authentication mechanism to use, seeAppendix A Authentication Options on page 40.

www.simba.com 12

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide

Using No AuthenticationWhen connecting to a Spark server of type Shark Server, you must use NoAuthentication.

To configure a connection without authentication:1. To access authentication options, open the ODBC Data Source Administrator where

you created the DSN, then select the DSN, and then click Configure2. In the Mechanism list, select No Authentication3. To save your settings and close the dialog box, click OK

Using Kerberos

Kerberos must be installed and configured before you can use this authenticationmechanism. For more information, see Appendix B Configuring Kerberos Authenticationfor Windows on page 43.

This authentication mechanism is available only for Shark Server 2 or Spark ThriftServer on non-HDInsight distributions.

To configure Kerberos authentication:1. To access authentication options, open the ODBC Data Source Administrator where

you created the DSN, then select the DSN, and then click Configure2. In the Mechanism list, select Kerberos3. If your Kerberos setup does not define a default realm or if the realm of your Shark

Server 2 or Spark Thrift Server host is not the default, then type the Kerberos realmof the Shark Server 2 or Spark Thrift Server host in the Realm field.

ORTo use the default realm defined in your Kerberos setup, leave the Realm fieldempty.

4. In the Host FQDN field, type the fully qualified domain name of the Shark Server 2 orSpark Thrift Server host.

5. In the Service Name field, type the service name of the Spark server.6. To save your settings and close the dialog box, click OK

Using User Name

This authentication mechanism requires a user name but not a password. The user namelabels the session, facilitating database tracking.

This authentication mechanism is available only for Shark Server 2 or Spark ThriftServer on non-HDInsight distributions. Most default configurations of SharkServer 2 or Spark Thrift Server require User Name authentication.

www.simba.com 13

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide

To configure User Name authentication:1. To access authentication options, open the ODBC Data Source Administrator where

you created the DSN, then select the DSN, and then click Configure2. In the Mechanism list, select User Name3. In the User Name field, type an appropriate user name for accessing the Spark

server.4. To save your settings and close the dialog box, click OK

Using User Name and Password

This authentication mechanism requires a user name and a password.

This authentication mechanism is available only for Shark Server 2 or Spark ThriftServer on non-HDInsight distributions.

To configure User Name and Password authentication:1. To access authentication options, open the ODBC Data Source Administrator where

you created the DSN, then select the DSN, and then click Configure2. In the Mechanism list, select User Name and Password3. In the User Name field, type an appropriate user name for accessing the Spark

server.4. In the Password field, type the password corresponding to the user name you typed

in step 3.5. To save your settings and close the dialog box, click OK

Using User Name and Password (SSL)

This authentication mechanism uses SSL and requires a user name and a password. Thedriver accepts self-signed SSL certificates for this authentication mechanism.

This authentication mechanism is available only for Shark Server 2 or Spark ThriftServer on non-HDInsight distributions.

To configure User Name and Password (SSL) authentication:1. To access authentication options, open the ODBC Data Source Administrator where

you created the DSN, then select the DSN, and then click Configure2. In the Mechanism list, select User Name and Password (SSL)3. In the User Name field, type an appropriate user name for accessing the Spark

server.4. In the Password field, type the password corresponding to the user name you typed

in step 3.5. Optionally, configure the driver to allow the common name of a CA-issued certificate

to not match the host name of the Spark server by clicking Advanced Options, andthen selecting the Allow Common Name Host Name Mismatch check box.

www.simba.com 14

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide

For self-signed certificates, the driver always allows the common name ofthe certificate to not match the host name.

6. To configure the driver to load SSL certificates from a specific PEM file, clickAdvanced Options, and then type the path to the file in the Trusted Certificatesfield.

ORTo use the trusted CA certificates PEM file that is installed with the driver, leave theTrusted Certificates field empty.

7. To save your settings and close the dialog box, click OK

Using Windows Azure HDInsight Emulator

This authentication mechanism enables you to connect to a Spark server on WindowsAzure HDInsight Emulator.

This authentication mechanism is available only for Shark Server 2 or Spark ThriftServer on an HDInsight distribution.

To configure a connection to a Spark server on Windows Azure HDInsightEmulator:1. To access authentication options, open the ODBC Data Source Administrator where

you created the DSN, then select the DSN, and then click Configure2. In the Mechanism list, select Windows Azure HDInsight Emulator3. In the HTTP Path field, type the partial URL corresponding to the Spark server.4. In the User Name field, type an appropriate user name for accessing the Spark

server.5. In the Password field, type the password corresponding to the user name you typed

in step 4.6. To save your settings and close the dialog box, click OK

Using Windows Azure HDInsight Service

This authentication mechanism enables you to connect to a Spark server on WindowsAzure HDInsight Service. The driver does not accept self-signed SSL certificates for thisauthentication mechanism. Also, the common name of the CA-issued certificate mustmatch the host name of the Spark server.

This authentication mechanism is available only for Shark Server 2 or Spark ThriftServer on HDInsight distributions.

To configure a connection to a Spark server on Windows Azure HDInsightService:1. To access authentication options, open the ODBC Data Source Administrator where

you created the DSN, then select the DSN, and then click Configure2. In the Mechanism list, select Windows Azure HDInsight Service

www.simba.com 15

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide

3. In the HTTP Path field, type the partial URL corresponding to the Spark server.4. In the User Name field, type an appropriate user name for accessing the Spark

server.5. In the Password field, type the password corresponding to the user name you typed

in step 4.6. To configure the driver to load SSL certificates from a specific PEM file, click

Advanced Options and type the path to the file in the Trusted Certificates field.OR

To use the trusted CA certificates PEM file that is installed with the driver, leave theTrusted Certificates field empty.

7. To save your settings and close the dialog box, click OK

Using HTTP

This authentication mechanism enables you to connect to a Shark Server 2 or Spark ThriftServer running in HTTP mode.

This authentication mechanism is available only for Shark Server 2 or Spark ThriftServer on non-HDInsight distributions.

To configure HTTP authentication:1. To access authentication options, open the ODBC Data Source Administrator where

you created the DSN, then select the DSN, and then click Configure2. In the Mechanism list, select HTTP3. In the HTTP Path field, type the partial URL corresponding to the Spark server.4. In the User Name field, type an appropriate user name for accessing the Spark

server.5. In the Password field, type the password corresponding to the user name you typed

in step 4.6. To save your settings and close the dialog box, click OK

Using HTTPS

This authentication mechanism enables you to connect to a Shark Server 2 or Spark ThriftServer that is running in HTTP mode and has SSL enabled. The driver accepts self-signedSSL certificates for this authentication mechanism.

This authentication mechanism is available only for Shark Server 2 or Spark ThriftServer on non-HDInsight distributions.

To configure HTTPS authentication:1. To access authentication options, open the ODBC Data Source Administrator where

you created the DSN, then select the DSN, and then click Configure2. In the Mechanism list, select HTTPS

www.simba.com 16

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide

3. In the HTTP Path field, type the partial URL corresponding to the Spark server.4. In the User Name field, type an appropriate user name for accessing the Spark

server.5. In the Password field, type the password corresponding to the user name you typed

in step 4.6. Optionally, configure the driver to allow the common name of a CA-issued certificate

to not match the host name of the Spark server by clicking Advanced Options andselecting the Allow Common Name Host Name Mismatch check box.

For self-signed certificates, the driver always allows the common name ofthe certificate to not match the host name.

7. To configure the driver to load SSL certificates from a specific PEM file, clickAdvanced Options, and then type the path to the file in the Trusted Certificatesfield.

ORTo use the trusted CA certificates PEM file that is installed with the driver, leave theTrusted Certificates field empty.

8. To save your settings and close the dialog box, click OK

The HTTPS authentication method can also be used to connect to Shark Server 2or Spark Thrift Server via the Knox gateway. To determine what user credentialsto use and what value to set for the HTTP Path field, refer to the Knoxdocumentation.

Configuring Advanced Options

You can configure advanced options to modify the behavior of the driver.

To configure advanced options:1. To access advanced options, open the ODBC Data Source Administrator where you

created the DSN, then select the DSN, then click Configure, and then clickAdvanced Options

2. To disable the SQL Connector feature, select the Use Native Query check box.3. To defer query execution to SQLExecute, select the Fast SQLPrepare check box.4. To allow driver-wide configurations to take precedence over connection and DSN

settings, select the Driver Config Take Precedence check box.5. To use the asynchronous version of the API call against Spark for executing a query,

select the Use Async Exec check box.6. To retrieve the names of tables in a database by using the SHOW TABLES query,

select the Get Tables With Query check box.This option is applicable only when connecting to Shark Server 2 or SparkThrift Server.

www.simba.com 17

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide

7. To enable the driver to return SQL_WVARCHAR instead of SQL_VARCHAR forSTRING and VARCHAR columns, and SQL_WCHAR instead of SQL_CHAR forCHAR columns, select the Unicode SQL character types check box.

8. To enable the driver to return the spark_system table for catalog function calls suchas SQLTables and SQLColumns, select the Show system table check box.

9. In the Rows fetched per block field, type the number of rows to be fetched per block.10. In the Default string column length field, type the maximum data length for STRING

columns.11. In the Binary column length field, type the maximum data length for BINARY

columns.12. In the Decimal column scale field, type the maximum number of digits to the right of

the decimal point for numeric data types.13. In the Async Exec Poll Interval (ms) field, type the time in milliseconds between each

poll for the query execution status.

This option is applicable only to HDInsight clusters.

14.To allow the common name of a CA-issued SSL certificate to not match the hostname of the Spark server, select the Allow Common Name Host Name Mismatchcheck box.

This option is applicable only to the User Name and Password (SSL) andHTTPS authentication mechanisms.

15.To configure the driver to load SSL certificates from a specific PEM file, type thepath to the file in the Trusted Certificates field.

ORTo use the trusted CA certificates PEM file that is installed with the driver, leave theTrusted Certificates field empty.

This option is applicable only to the User Name and Password (SSL),Windows Azure HDInsight Service, and HTTPS authenticationmechanisms.

16.To save your settings and close the Advanced Options dialog box, click OK

Configuring Server-Side Properties

You can use the driver to apply configuration properties to the Spark server.

To configure server-side properties:1. To configure server-side properties, open the ODBC Data Source Administrator

where you created the DSN, then select the DSN and click Configure, then clickAdvanced Options, and then click Server Side Properties

2. To create a server-side property, click Add, then type appropriate values in the Keyand Value fields, and then click OK

www.simba.com 18

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide

For a list of all Hadoop and Spark server-side properties that yourimplementation supports, type set -v at the Spark CLI command line. Youcan also execute the set -v query after connecting using the driver.

3. To edit a server-side property, select the property from the list, then click Edit, thenupdate the Key and Value fields as needed, and then click OK

4. To delete a server-side property, select the property from the list, and then clickRemove. In the confirmation dialog box, click Yes

5. To configure the driver to apply each server-side property by executing a querywhen opening a session to the Spark server, select the Apply Server SideProperties with Queries check box.

ORTo configure the driver to use a more efficient method for applying server-sideproperties that does not involve additional network round-tripping, clear the ApplyServer Side Properties with Queries check box.

The more efficient method is not available for Shark Server, and it might notbe compatible with some Shark Server 2 or Spark Thrift Server builds. If theserver-side properties do not take effect when the check box is clear, thenselect the check box.

6. To force the driver to convert server-side property key names to all lower casecharacters, select the Convert Key Name to Lower Case check box.

7. To save your settings and close the Server Side Properties dialog box, click OK

www.simba.com 19

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide

Linux Driver

System Requirements

You install the Simba ODBC Driver with SQL Connector for Apache Spark on clientcomputers accessing data in a Hadoop cluster with the Spark service installed andrunning. Each computer where you install the driver must meet the following minimumsystem requirements:

l One of the following distributions (32- and 64-bit editions are supported):o Red Hat® Enterprise Linux® (RHEL) 5.0 or 6.0o CentOS 5.0 or 6.0o SUSE Linux Enterprise Server (SLES) 11

l 45 MB of available disk spacel One of the following ODBC driver managers installed:

o iODBC 3.52.7 or latero unixODBC 2.3.0 or later

The driver is suitable for use with all versions of Spark.

Installing the Driver Using the RPM

There are two versions of the driver for Linux:l SimbaSparkODBC-32bit-Version-Release.LinuxDistro.i686.rpm for the 32-bitdriver

l SimbaSparkODBC-Version-Release.LinuxDistro.x86_64.rpm for the 64-bitdriver

Version is the version number of the driver, and Release is the release number for thisversion of the driver. LinuxDistro is either el5 or el6. For SUSE, the LinuxDistroplaceholder is empty.

The bitness of the driver that you select should match the bitness of the client applicationaccessing your Hadoop-based data. For example, if the client application is 64-bit, thenyou should install the 64-bit driver. Note that 64-bit editions of Linux support both 32- and64-bit applications. Verify the bitness of your intended application and install theappropriate version of the driver.

Ensure that you install the driver using the RPM corresponding to your Linuxdistribution.

www.simba.com 20

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide

The Simba ODBC Driver with SQL Connector for Apache Spark driver files are installed inthe following directories:

l /opt/simba/sparkodbc contains release notes, the Simba ODBC Driver with SQLConnector for Apache Spark Installation and Configuration Guide in PDF format,and a Readme.txt file that provides plain text installation and configuration instruc-tions.

l /opt/simba/sparkodbc/ErrorMessages contains error message files required bythe driver.

l /opt/simba/sparkodbc/Setup contains sample configuration files named odbc.iniand odbcinst.ini

l /opt/simba/sparkodbc/lib/32 contains the 32-bit shared libraries and the simba.s-parkodbc.ini configuration file.

l /opt/simba/sparkodbc/lib/64 contains the 64-bit shared libraries and the simba.s-parkodbc.ini configuration file.

To install the Simba ODBC Driver with SQL Connector for Apache Spark:1. In Red Hat Enterprise Linux or CentOS, log in as the root user, then navigate to the

folder containing the driver RPM packages to install, and then type the following atthe command line, where RPMFileName is the file name of the RPM packagecontaining the version of the driver that you want to install:yum --nogpgcheck localinstall RPMFileName

ORIn SUSE Linux Enterprise Server, log in as the root user, then navigate to the foldercontaining the driver RPM packages to install, and then type the following at thecommand line, where RPMFileName is the file name of the RPM package containingthe version of the driver that you want to install:zypper install RPMFileName

2. If you received a license file via e-mail, then copy the license file into the/opt/simba/sparkodbc/lib/32 or /opt/simba/sparkodbc/lib/64 folder, depending on theversion of the driver that you installed.To avoid security issues, you may need to save the license file on your localcomputer prior to copying the file into the folder.

The Simba ODBC Driver with SQL Connector for Apache Spark depends on the followingresources:

l cyrus-sasl-2.1.22-7 or abovel cyrus-sasl-gssapi-2.1.22-7 or abovel cyrus-sasl-plain-2.1.22-7 or above

If the package manager in your Linux distribution cannot resolve the dependenciesautomatically when installing the driver, then download and manually install the packagesrequired by the version of the driver that you want to install.

www.simba.com 21

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide

Installing the Driver Using the Tarball Package

Alternatively, the Simba ODBC Driver with SQL Connector for Apache Spark is availablefor installation using a TAR.GZ tarball package. The tarball package includes thefollowing, where INSTALL_DIR is your chosen installation directory:

l INSTALL_DIR/simba/sparkodbc/ contains the release notes, the Simba ODBCDriver with SQL Connector for Apache Spark Installation and Configuration Guide inPDF format, and a Readme.txt file that provides plain text installation and con-figuration instructions.

l INSTALL_DIR/simba/sparkodbc/lib/32 contains the 32-bit driver and the simba.s-parkodbc.ini configuration file.

l INSTALL_DIR/simba/sparkodbc/lib/64 contains the 64-bit driver and the simba.s-parkodbc.ini configuration file.

l INSTALL_DIR/simba/sparkodbc/ErrorMessages contains error message filesrequired by the driver.

l INSTALL_DIR/simba/sparkodbc/Setup contains sample configuration files namedodbc.ini and odbcinst.ini

If you received a license file via e-mail, then copy the license file into theINSTALL_DIR/simba/sparkodbc/lib/32 or INSTALL_DIR/simba/sparkodbc/lib/64folder, depending on the version of the driver that you installed.

Verifying the Version Number

If you need to verify the version of the Simba ODBC Driver with SQL Connector forApache Spark that is installed on your Linux machine, you can query the version numberthrough the command-line interface.

To verify the version number:At the command prompt, run the following command:yum list | grep SimbaSparkODBC

ORRun the following command:rpm -qa | grep SimbaSparkODBC

The command returns information about the Simba ODBC Driver with SQL Connector forApache Spark that is installed on your machine, including the version number.

Setting the LD_LIBRARY_PATH Environment Variable

The LD_LIBRARY_PATH environment variable must include the path to the installedODBC driver manager libraries.

For example, if ODBC driver manager libraries are installed in /usr/local/lib, then set LD_LIBRARY_PATH as follows:

www.simba.com 22

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib

For information about how to set environment variables permanently, refer to your Linuxshell documentation.

For information about creating ODBC connections using the Simba ODBC Driver withSQL Connector for Apache Spark, see Configuring ODBC Connections for Non-WindowsPlatforms on page 26.

www.simba.com 23

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide

Mac OS X Driver

System Requirements

You install the Simba ODBC Driver with SQL Connector for Apache Spark on clientcomputers accessing data in a Hadoop cluster with the Spark service installed andrunning. Each computer where you install the driver must meet the following minimumsystem requirements:

l Mac OS X version 10.6.8 or laterl 100 MB of available disk spacel iODBC 3.52.7 or later

The driver is suitable for use with all versions of Spark. The driver supports both 32- and64-bit client applications.

Installing the Driver

The Simba ODBC Driver with SQL Connector for Apache Spark driver files are installed inthe following directories:

l /opt/simba/sparkodbc contains release notes and the Simba ODBC Driver withSQL Connector for Apache Spark Installation and Configuration Guide in PDFformat.

l /opt/simba/sparkodbc/ErrorMessages contains error messages required by thedriver.

l /opt/simba/sparkodbc/Setup contains sample configuration files named odbc.iniand odbcinst.ini

l /opt/simba/sparkodbc/lib/universal contains the driver binaries and the simba.s-parkodbc.ini configuration file.

To install the Simba ODBC Driver with SQL Connector for Apache Spark:1. Double-click SimbaSparkODBC.dmg to mount the disk image.2. Double-click SimbaSparkODBC.pkg to run the installer.3. In the installer, click Continue4. On the Software License Agreement screen, click Continue, and when the prompt

appears, click Agree if you agree to the terms of the License Agreement.5. Optionally, to change the installation location, click Change Install Location, then

select the desired location, and then click Continue6. To accept the installation location and begin the installation, click Install7. When the installation completes, click Close8. If you received a license file via e-mail, then copy the license file into the

/opt/simba/sparkodbc/lib/universal folder.

www.simba.com 24

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide

To avoid security issues, you may need to save the license file on your localcomputer prior to copying the file into the folder.

Verifying the Version Number

If you need to verify the version of the Simba ODBC Driver with SQL Connector forApache Spark that is installed on your Mac OS X machine, you can query the versionnumber through the Terminal.

To verify the version number:At the Terminal, run the following command:pkgutil --info simba.sparkodbc

The command returns information about the Simba ODBC Driver with SQL Connector forApache Spark that is installed on your machine, including the version number.

Setting the DYLD_LIBRARY_PATH Environment Variable

The DYLD_LIBRARY_PATH environment variable must include the path to the installedODBC driver manager libraries.

For example, if ODBC driver manager libraries are installed in /usr/local/lib, then setDYLD_LIBRARY_PATH as follows:export DYLD_LIBRARY_PATH=$DYLD_LIBRARY_PATH:/usr/local/lib

For information about how to set environment variables permanently, refer to your Mac OSX shell documentation.

For information about creating ODBC connections using the Simba ODBC Driver withSQL Connector for Apache Spark, see Configuring ODBC Connections for Non-WindowsPlatforms on page 26.

www.simba.com 25

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide

Configuring ODBC Connections for Non-WindowsPlatforms

The following sections describe how to configure ODBC connection when using the SimbaODBC Driver with SQL Connector for Apache Spark with non-Windows platforms:

l Files on page 26l Sample Files on page 27l Configuring the Environment on page 27l Configuring the odbc.ini File on page 28l Configuring the odbcinst.ini File on page 29l Configuring the simba.sparkodbc.ini File on page 30l Configuring Authentication on page 31

Files

ODBC driver managers use configuration files to define and configure ODBC data sourcesand drivers. By default, the following configuration files residing in the user’s homedirectory are used:

l .odbc.ini is used to define ODBC data sources, and it is required.l .odbcinst.ini is used to define ODBC drivers, and it is optional.

Also, by default the Simba ODBC Driver with SQL Connector for Apache Spark isconfigured using the simba.sparkodbc.ini file, which is located in one of the followingdirectories depending on the version of the driver that you are using:

l /opt/simba/sparkodbc/lib/32 for the 32-bit driver on Linuxl /opt/simba/sparkodbc/lib/64 for the 64-bit driver on Linuxl /opt/simba/sparkodbc/lib/universal for the driver on Mac OS X

The simba.sparkodbc.ini file is required.

The simba.sparkodbc.ini file in the /lib subfolder provides default settings for mostconfiguration options available in the Simba ODBC Driver with SQL Connector forApache Spark.

You can set driver configuration options in your odbc.ini and simba.sparkodbc.ini files.Configuration options set in a simba.sparkodbc.ini file apply to all connections, whereasconfiguration options set in an odbc.ini file are specific to a connection. Configurationoptions set in odbc.ini take precedence over configuration options set insimba.sparkodbc.ini. For information about the configuration options available forcontrolling the behavior of DSNs that are using the Simba ODBC Driver with SQLConnector for Apache Spark, see Appendix C Driver Configuration Options on page 47.

www.simba.com 26

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide

Sample Files

The driver installation contains the following sample configuration files in the Setupdirectory:

l odbc.inil odbcinst.ini

These sample configuration files provide preset values for settings related to the SimbaODBC Driver with SQL Connector for Apache Spark.

The names of the sample configuration files do not begin with a period (.) so that they willappear in directory listings by default. A filename beginning with a period (.) is hidden. Forodbc.ini and odbcinst.ini, if the default location is used, then the filenames must begin witha period (.).

If the configuration files do not exist in the home directory, then you can copy the sampleconfiguration files to the home directory, and then rename the files. If the configurationfiles already exist in the home directory, then use the sample configuration files as a guideto modify the existing configuration files.

Configuring the Environment

Optionally, you can use three environment variables—ODBCINI, ODBCSYSINI, andSIMBASPARKINI—to specify different locations for the odbc.ini, odbcinst.ini, andsimba.sparkodbc.ini configuration files by doing the following:

l Set ODBCINI to point to your odbc.ini file.l Set ODBCSYSINI to point to the directory containing the odbcinst.ini file.l Set SIMBASPARKINI to point to your simba.sparkodbc.ini file.

For example, if your odbc.ini and simba.sparkodbc.ini files are located in /etc and yourodbcinst.ini file is located in /usr/local/odbc, then set the environment variables as follows:export ODBCINI=/etc/odbc.ini

export ODBCSYSINI=/usr/local/odbc

export SIMBASPARKINI=/etc/simba.sparkodbc.ini

The following search order is used to locate the simba.sparkodbc.ini file:1. If the SIMBASPARKINI environment variable is defined, then the driver searches for

the file specified by the environment variable.SIMBASPARKINI must specify the full path, including the file name.

2. The directory containing the driver’s binary is searched for a file namedsimba.sparkodbc.ini not beginning with a period.

3. The current working directory of the application is searched for a file namedsimba.sparkodbc.ini not beginning with a period.

www.simba.com 27

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide

4. The directory ~/ (that is, $HOME) is searched for a hidden file named.simba.sparkodbc.ini

5. The directory /etc is searched for a file named simba.sparkodbc.ini not beginningwith a period.

Configuring the odbc.ini File

ODBC Data Source Names (DSNs) are defined in the odbc.ini configuration file. The file isdivided into several sections:

l [ODBC] is optional and used to control global ODBC configuration, such as ODBCtracing.

l [ODBC Data Sources] is required, listing DSNs and associating DSNs with adriver.

l A section having the same name as the data source specified in the [ODBC DataSources] section is required to configure the data source.

The following is an example of an odbc.ini configuration file for Linux:[ODBC Data Sources]

Sample Simba Spark DSN 32=Simba Spark ODBC Driver 32-bit

[Sample Simba Spark DSN 32]

Driver=/opt/simba/sparkodbc/lib/32/libsimbasparkodbc32.so

HOST=MySparkServer

PORT=10000

MySparkServer is the IP address or host name of the Spark server.

The following is an example of an odbc.ini configuration file for Mac OS X:[ODBC Data Sources]

Sample Simba Spark DSN=Simba Spark ODBC Driver

[Sample Simba Spark DSN]

Driver=/opt/simba/sparkodbc/lib/universal/libsimbasparkodbc.dylib

HOST=MySparkServer

PORT=10000

MySparkServer is the IP address or host name of the Spark server.

To create a Data Source Name:1. Open the .odbc.ini configuration file in a text editor.2. In the [ODBC Data Sources] section, add a new entry by typing the Data Source

Name (DSN), then an equal sign (=), and then the driver name.

www.simba.com 28

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide

3. In the .odbc.ini file, add a new section with a name that matches the DSN youspecified in step 2, and then add configuration options to the section. Specifyconfiguration options as key-value pairs.

Shark Server does not support authentication. Most default configurationsof Shark Server 2 or Spark Thrift Server require User Name authentication,which you configure by setting the AuthMech key to 2. To verify theauthentication mechanism that you need to use for your connection, checkthe configuration of your Hadoop / Spark distribution. For more information,see Authentication Options on page 40.

4. Save the .odbc.ini configuration file.

For information about the configuration options available for controlling the behavior ofDSNs that are using the Simba ODBC Driver with SQL Connector for Apache Spark, seeAppendix C Driver Configuration Options on page 47.

Configuring the odbcinst.ini File

ODBC drivers are defined in the odbcinst.ini configuration file. The configuration file isoptional because drivers can be specified directly in the odbc.ini configuration file, asdescribed in Configuring the odbc.ini File on page 28.

The odbcinst.ini file is divided into the following sections:l [ODBC Drivers] lists the names of all the installed ODBC drivers.l A section having the same name as the driver name specified in the [ODBC Drivers]section lists driver attributes and values.

The following is an example of an odbcinst.ini configuration file for Linux:[ODBC Drivers]

Simba Spark ODBC Driver 32-bit=Installed

Simba Spark ODBC Driver 64-bit=Installed

[Simba Spark ODBC Driver 32-bit]

Description=Simba Spark ODBC Driver (32-bit)

Driver=/opt/simba/sparkodbc/lib/32/libsimbasparkodbc32.so

[Simba Spark ODBC Driver 64-bit]

Description=Simba Spark ODBC Driver (64-bit)

Driver=/opt/simba/sparkodbc/lib/64/libsimbasparkodbc64.so

The following is an example of an odbcinst.ini configuration file for Mac OS X:[ODBC Drivers]

Simba Spark ODBC Driver=Installed

[Simba Spark ODBC Driver]

Description=Simba Spark ODBC Driver

www.simba.com 29

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide

Driver=/opt/simba/sparkodbc/lib/universal/libsimbasparkodbc.dylib

To define a driver:1. Open the .odbcinst.ini configuration file in a text editor.2. In the [ODBC Drivers] section, add a new entry by typing the driver name and then

typing =InstalledType a symbolic name that you want to use to refer to the driver inconnection strings or DSNs.

3. In the .odbcinst.ini file, add a new section with a name that matches the driver nameyou typed in step 2, and then add configuration options to the section based on thesample odbcinst.ini file provided in the Setup directory. Specify configuration optionsas key-value pairs.

4. Save the .odbcinst.ini configuration file.

Configuring the simba.sparkodbc.ini File

The simba.sparkodbc.ini file contains configuration settings for the Simba ODBC Driverwith SQL Connector for Apache Spark. Settings that you define in the simba.sparkodbc.inifile apply to all connections that use the driver.

To configure the Simba ODBC Driver with SQL Connector for Apache Spark towork with your ODBC driver manager:1. Open the simba.sparkodbc.ini configuration file in a text editor.2. Edit the DriverManagerEncoding setting. The value is usually UTF-16 or UTF-32,

depending on the ODBC driver manager you use. iODBC uses UTF-32, andunixODBC uses UTF-16. To determine the correct setting to use, refer to yourODBC Driver Manager documentation.

3. Edit the ODBCInstLib setting. The value is the name of the ODBCInst shared libraryfor the ODBC driver manager you use. To determine the correct library to specify,refer to your ODBC driver manager documentation.The configuration file defaults to the shared library for iODBC. In Linux, the sharedlibrary name for iODBC is libiodbcinst.so. In Mac OS X, the shared library name foriODBC is libiodbcinst.dylib.

You can specify an absolute or relative filename for the library. If you intendto use the relative filename, then the path to the library must be included inthe library path environment variable. In Linux, the library path environmentvariable is named LD_LIBRARY_PATH. In Mac OS X, the library pathenvironment variable is named DYLD_LIBRARY_PATH.

4. Save the simba.sparkodbc.ini configuration file.

www.simba.com 30

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide

Configuring Authentication

You can select the type of authentication to use for a connection by defining the AuthMechconnection attribute in a connection string or in a DSN (in the odbc.ini file). Depending onthe authentication mechanism you use, there may be additional connection attributes thatyou must define. For more information about the attributes involved in configuringauthentication, see Appendix C Driver Configuration Options on page 47.

For information about selecting the appropriate authentication mechanism to use, seeAppendix A Authentication Options on page 40.

Using No AuthenticationWhen connecting to a Spark server of type Shark Server, you must use NoAuthentication.

To configure a connection without authentication:Set the AuthMech connection attribute to 0

Using Kerberos

Kerberos must be installed and configured before you can use this authenticationmechanism. For more information, refer to the MIT Kerberos documentation.

This authentication mechanism is available only for Shark Server 2 or Spark ThriftServer on non-HDInsight distributions.

To configure Kerberos authentication:1. Set the AuthMech connection attribute to 12. If your Kerberos setup does not define a default realm or if the realm of your Spark

server is not the default, then set the appropriate realm using the KrbRealmattribute.

ORTo use the default realm defined in your Kerberos setup, do not set the KrbRealmattribute.

3. Set the KrbHostFQDN attribute to the fully qualified domain name of the SharkServer 2 or Spark Thrift Server host.

4. Set the KrbServiceName attribute to the service name of the Shark Server 2 orSpark Thrift Server.

Using User Name

This authentication mechanism requires a user name but does not require a password.The user name labels the session, facilitating database tracking.

www.simba.com 31

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide

This authentication mechanism is available only for Shark Server 2 or Spark ThriftServer on non-HDInsight distributions. Most default configurations of SharkServer 2 or Spark Thrift Server require User Name authentication.

To configure User Name authentication:1. Set the AuthMech connection attribute to 22. Set the UID attribute to an appropriate user name for accessing the Spark server.

Using User Name and Password

This authentication mechanism requires a user name and a password.

This authentication mechanism is available only for Shark Server 2 or Spark ThriftServer on non-HDInsight distributions.

To configure User Name and Password authentication:1. Set the AuthMech connection attribute to 32. Set the UID attribute to an appropriate user name for accessing the Spark server.3. Set the PWD attribute to the password corresponding to the user name you provided

in step 2.

Using User Name and Password (SSL)

This authentication mechanism uses SSL and requires a user name and a password. Thedriver accepts self-signed SSL certificates for this authentication mechanism.

This authentication mechanism is available only for Shark Server 2 or Spark ThriftServer on non-HDInsight distributions.

To configure User Name and Password (SSL) authentication:1. Set the AuthMech connection attribute to 42. Set the UID attribute to an appropriate user name for accessing the Spark server.3. Set the PWD attribute to the password corresponding to the user name you provided

in step 2.4. Optionally, configure the driver to allow the common name of a CA-issued certificate

to not match the host name of the Spark server by setting theCAIssuedCertNamesMismatch attribute to 1.

For self-signed certificates, the driver always allows the common name ofthe certificate to not match the host name.

5. To configure the driver to load SSL certificates from a specific PEM file, set theTrustedCerts attribute to the path of the file.

ORTo use the trusted CA certificates PEM file that is installed with the driver, do notspecify a value for the TrustedCerts attribute.

www.simba.com 32

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide

Using Windows Azure HDInsight Emulator

This authentication mechanism enables you to connect to a Spark server on WindowsAzure HDInsight Emulator.

This authentication mechanism is available only for Shark Server 2 or Spark ThriftServer on an HDInsight distribution.

To configure a connection to a Spark server on Windows Azure HDInsightEmulator:1. Set the AuthMech connection attribute to 52. Set the HTTPPath attribute to the partial URL corresponding to the Spark server.3. Set the UID attribute to an appropriate user name for accessing the Spark server.4. Set the PWD attribute to the password corresponding to the user name you provided

in step 3.

Using Windows Azure HDInsight Service

This authentication mechanism enables you to connect to a Spark server on WindowsAzure HDInsight Service. The driver does not accept self-signed SSL certificates for thisauthentication mechanism. Also, the common name of the CA-issued certificate mustmatch the host name of the Spark server.

This authentication mechanism is available only for Shark Server 2 or Spark ThriftServer on an HDInsight distribution.

To configure a connection to a Spark server on Windows Azure HDInsightService:1. Set the AuthMech connection attribute to 62. Set the HTTPPath attribute to the partial URL corresponding to the Spark server.3. Set the UID attribute to an appropriate user name for accessing the Spark server.4. Set the PWD attribute to the password corresponding to the user name you typed in

step 3.5. To configure the driver to load SSL certificates from a specific file, set the

TrustedCerts attribute to the path of the file.OR

To use the trusted CA certificates PEM file that is installed with the driver, do notspecify a value for the TrustedCerts attribute.

Using HTTP

This authentication mechanism enables you to connect to a Shark Server 2 or Spark ThriftServer running in HTTP mode.

This authentication mechanism is available only for Shark Server 2 or Spark ThriftServer on non-HDInsight distributions.

www.simba.com 33

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide

To configure HTTP authentication:1. Set the AuthMech connection attribute to 72. Set the HTTPPath attribute to the partial URL corresponding to the Spark server.3. Set the UID attribute to an appropriate user name for accessing the Spark server.4. Set the PWD attribute to the password corresponding to the user name you typed in

step 3.

Using HTTPS

This authentication mechanism enables you to connect to a Shark Server 2 or Spark ThriftServer that is running in HTTP mode and has SSL enabled. The driver accepts self-signedSSL certificates for this authentication mechanism.

This authentication mechanism is available only for Shark Server 2 or Spark ThriftServer on non-HDInsight distributions.

To configure HTTPS authentication:1. Set the AuthMech connection attribute to 82. Set the HTTPPath attribute to the partial URL corresponding to the Spark server.3. Set the UID attribute to an appropriate user name for accessing the Spark server.4. Set the PWD attribute to the password corresponding to the user name you typed in

step 3.5. Optionally, configure the driver to allow the common name of a CA-issued certificate

to not match the host name of the Spark server by setting theCAIssuedCertNamesMismatch attribute to 1.

For self-signed certificates, the driver always allows the common name ofthe certificate to not match the host name.

6. To configure the driver to load SSL certificates from a specific file, set theTrustedCerts attribute to the path of the file.

ORTo use the trusted CA certificates PEM file that is installed with the driver, do notspecify a value for the TrustedCerts attribute.

The HTTPS authentication method can also be used to connect to Shark Server 2or Spark Thrift Server via the Knox gateway. To determine what user credentialsto use and what value to set for the HTTPPath attribute, refer to the Knoxdocumentation.

www.simba.com 34

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide

Features

More information is provided on the following features of the Simba ODBC Driver withSQL Connector for Apache Spark:

l SQL Query versus Spark SQL Query on page 35l SQL Connector on page 35l Data Types on page 35l Catalog and Schema Support on page 36l spark_system Table on page 37l Server-Side Properties on page 37l Get Tables With Query on page 37

SQL Query versus Spark SQL Query

The native query language supported by Spark is Spark SQL. For simple queries, SparkSQL is a subset of SQL-92. However, the syntax is different enough that most applicationsdo not work with native Spark SQL.

SQL Connector

To bridge the difference between SQL and Spark SQL, the SQL Connector featuretranslates standard SQL-92 queries into equivalent Spark SQL queries. The SQLConnector performs syntactical translations and structural transformations. For example:

l Quoted Identifiers— The double quotes (") that SQL uses to quote identifiers aretranslated into back quotes (`) to match Spark SQL syntax. The SQL Connectorneeds to handle this translation because even when a driver reports the back quoteas the quote character, some applications still generate double-quoted identifiers.

l Table Aliases— Support is provided for the AS keyword between a table referenceand its alias, which Spark SQL normally does not support.

l JOIN, INNER JOIN, and CROSS JOIN— SQL JOIN, INNER JOIN, and CROSSJOIN syntax is translated to Spark SQL JOIN syntax.

l TOP N/LIMIT— SQL TOP N queries are transformed to Spark SQL LIMIT queries.

Data Types

The Simba ODBC Driver with SQL Connector for Apache Spark supports many commondata formats, converting between Spark data types and SQL data types.

Table 1 lists the supported data type mappings.

www.simba.com 35

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide

Spark Type SQL Type

TINYINT SQL_TINYINT

SMALLINT SQL_SMALLINT

INT SQL_INTEGER

BIGINT SQL_BIGINT

FLOAT SQL_REAL

DOUBLE SQL_DOUBLE

DECIMAL SQL_DECIMAL

BOOLEAN SQL_BIT

STRING SQL_VARCHAR

TIMESTAMP SQL_TYPE_TIMESTAMP

VARCHAR(n) SQL_VARCHAR

DATE SQL_TYPE_DATE

DECIMAL(p,s) SQL_DECIMAL

CHAR(n) SQL_CHAR

BINARY SQL_VARBINARY

Table 1. Supported Data Types

The aggregate types (ARRAY, MAP, and STRUCT) are not yet supported.Columns of aggregate types are treated as STRING columns.

Catalog and Schema Support

The Simba ODBC Driver with SQL Connector for Apache Spark supports both catalogsand schemas in order to make it easy for the driver to work with various ODBCapplications. Since Spark only organizes tables into schemas/databases, the driverprovides a synthetic catalog called “SPARK” under which all of the schemas/databasesare organized. The driver also maps the ODBC schema to the Spark schema/database.

www.simba.com 36

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide

spark_system Table

A pseudo-table called spark_system can be used to query for Spark cluster systemenvironment information. The pseudo-table is under the pseudo-schema called spark_system. The table has two STRING type columns, envkey and envvalue. Standard SQLcan be executed against the spark_system table. For example:SELECT * FROM SPARK.spark_system.spark_system WHERE envkeyLIKE '%spark%'

The above query returns all of the Spark system environment entries whose key containsthe word “spark.” A special query, set -v, is executed to fetch system environmentinformation. Some versions of Spark do not support this query. For versions of Spark thatdo not support querying system environment information, the driver returns an emptyresult set.

Server-Side Properties

The Simba ODBC Driver with SQL Connector for Apache Spark allows you to set server-side properties via a DSN. Server-side properties specified in a DSN affect only theconnection that is established using the DSN.

For more information about setting server-side properties when using the Windows driver,see Configuring Server-Side Properties on page 18. For information about setting server-side properties when using the driver on a non-Windows platform, see DriverConfiguration Options on page 47.

Get Tables With Query

The Get Tables With Query configuration option allows you to choose whether to use theSHOW TABLES query or the GetTables API call to retrieve table names from a database.

Shark Server 2 and Spark Thrift Server both have a limit on the number of tables that canbe in a database when handling the GetTables API call. When the number of tables in adatabase is above the limit, the API call will return a stack overflow error or a timeout error.The exact limit and the error that appears depends on the JVM settings.

As a workaround for this issue, enable the Get Tables with Query configuration option (orGetTablesWithQuery key) to use the query instead of the API call.

www.simba.com 37

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide

Known Issues in Spark

The following are known issues in Apache Spark that you might encounter while using thedriver.

Backquotes in Aliases Are Not Handled Correctly

In Spark 1.0.x and 1.1.x, the backquotes (`) surrounding identifiers are returned as part ofthe column names in the result set metadata. Backquotes are used to quote identifiers inSpark SQL, and should not be considered as part of the identifier.

This issue is fixed in Spark 1.2.x.

For more information, see the JIRA issue posted by Apache named Backticks Aren'tHandled Correctly in Aliases at https://issues.apache.org/jira/browse/SPARK-3708

Filtering on TIMESTAMP Columns Does Not Return Any Rows

In Spark 0.9.x, 1.0.x, and 1.1.0, using the WHERE statement to filter TIMESTAMPcolumns does not return any rows.

This issue is fixed in Spark 1.1.1 and later.

For more information, see the JIRA issue posted by Apache named Timestamp Support inthe Parser at https://issues.apache.org/jira/browse/SPARK-3173

Cannot Use AND to Combine a TIMESTAMP Column Filterwith Another Filter

In Spark 1.1.x, when you execute a query that uses the AND operator to combine aTIMESTAMP column filter with another filter, an error occurs.

As a workaround, use a subquery as shown in the following example:SELECT * FROM (SELECT * FROM timestamp_table WHERE(keycolumn='TimestampMicroSeconds')) s1 WHERE (column1 ='1955-10-11 11:10:33.123456');

www.simba.com 38

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide

Contact Us

If you have difficulty using the driver, please contact our Technical Support staff. Wewelcome your questions, comments, and feature requests.

Technical Support is available Monday to Friday from 8 a.m. to 5 p.m. Pacific Time.

To help us assist you, prior to contacting Technical Support please prepare adetailed summary of the client and server environment including operating systemversion, patch level, and configuration.

You can contact Technical Support via:l E-mail—[email protected] Web site—www.simba.coml Telephone—(604) 633-0008 Extension 3l Fax—(604) 633-0004

You can also follow us on Twitter @SimbaTech

www.simba.com 39

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide

Appendix A Authentication Options

Shark Server does not support authentication. You must select No Authentication as theauthentication mechanism.

Shark Server 2 or Spark Thrift Server on an HDInsight distribution supports the followingauthentication mechanisms:

l Windows Azure HDInsight Emulatorl Windows Azure HDInsight Service

Shark Server 2 or Spark Thrift Server on a non-HDInsight distribution supports thefollowing authentication mechanisms:

l No Authenticationl Kerberosl User Namel User Name and Passwordl User Name and Password (SSL)l HTTPl HTTPS

Most default configurations of Shark Server 2 or Spark Thrift Server on non-HDInsightdistributions require User Name authentication. If you are unable to connect to your Sparkserver using User Name authentication, then verify the authentication mechanismconfigured for your Spark server by examining the hive-site.xml file. Examine the followingproperties to determine which authentication mechanism your server is set to use:

l hive.server2.authenticationl hive.server2.enable.doAs

Table 2 lists the authentication mechanisms to configure for the driver based on thesettings in the hive-site.xml file.

hive.server2.authentication hive.server2.enable.doAs Driver AuthenticationMechanism

NOSASL False No Authentication

KERBEROS True or False Kerberos

NONE True or False User Name

Table 2. Spark Authentication Mechanism Configurations

www.simba.com 40

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide

hive.server2.authentication hive.server2.enable.doAs Driver AuthenticationMechanism

LDAP True or False User Name andPassword

ORUser Name andPassword (SSL)

User Name andPassword (SSL)can only be usedif your Sparkserver isconfigured withSSL.

It is an error to set hive.server2.authentication to NOSASL andhive.server2.enable.doAs to true. This configuration will not prevent the servicefrom starting up, but results in an unusable service.

For more information about authentication mechanisms, refer to the documentation foryour Hadoop / Spark distribution. See also the topic Running Hadoop in Secure Mode athttp://hadoop.apache.org/docs/r0.23.7/hadoop-project-dist/hadoop-common/ClusterSetup.html#Running_Hadoop_in_Secure_Mode

Using No Authentication

When hive.server2.authentication is set to NOSASL, you must configure your connectionto use No Authentication.

Using Kerberos

When connecting to a Spark server of type Shark Server 2 or Spark Thrift Server on a non-HDInsight distribution and hive.server2.authentication is set to KERBEROS, you mustconfigure your connection to use Kerberos authentication.

Using User Name

When connecting to a Spark server of type Shark Server 2 or Spark Thrift Server on a non-HDInsight distribution and hive.server2.authentication is set to NONE, you must configureyour connection to use User Name authentication. Validation of the credentials that youinclude depends on hive.server2.enable.doAs:

l If hive.server2.enable.doAs is set to true, then the user name in the DSN or driverconfiguration must be an existing OS user on the host that is running Shark Server 2

www.simba.com 41

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide

or Spark Thrift Server.l If hive.server2.enable.doAs is set to false, then the user name in the DSN or driverconfiguration is ignored.

If the user name is not specified in the DSN or driver configuration, then the driver defaultsto using “anonymous” as the user name.

Using User Name and Password

When connecting to a Spark server of type Shark Server 2 or Spark Thrift Server on a non-HDInsight distribution and the server is configured to use the SASL-PLAIN authenticationmechanism with a user name and a password, you must configure your connection to useUser Name and Password authentication.

Using User Name and Password (SSL)

When connecting to a Spark server of type Shark Server 2 or Spark Thrift Server on a non-HDInsight distribution and the server is configured to use SSL and the SASL-PLAINauthentication mechanism with a user name and a password, you must configure yourconnection to use User Name and Password (SSL) authentication.

Using Windows Azure HDInsight Emulator

When connecting to a Spark server on Windows Azure HDInsight Emulator, you mustconfigure your connection to use Windows Azure HDInsight Emulator.

Using Windows Azure HDInsight Service

When connecting to a Spark server on Windows Azure HDInsight Service, you mustconfigure your connection to use Windows Azure HDInsight Service.

Using HTTP

When connecting to a Spark server of type Shark Server 2 or Spark Thrift Server on a non-HDInsight distribution and the server is configured to use the Thrift HTTP transport over aTCP socket, you must configure your connection to use HTTP authentication.

Using HTTPS

When connecting to a Spark server of type Shark Server 2 or Spark Thrift Server on a non-HDInsight distribution and the server is configured to use the Thrift HTTP transport over aSSL socket, you must configure your connection to use HTTPS authentication.

www.simba.com 42

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide

Appendix B Configuring Kerberos Authenticationfor Windows

MIT Kerberos

Downloading and Installing MIT Kerberos for Windows 4.0.1

For information about Kerberos and download links for the installer, see the MIT Kerberoswebsite at http://web.mit.edu/kerberos/

To download and install MIT Kerberos for Windows 4.0.1:1. To download the Kerberos installer for 64-bit computers, use the following download

link from the MIT Kerberos website: http://web.mit.edu/kerberos/dist/kfw/4.0/kfw-4.0.1-amd64.msi

ORTo download the Kerberos installer for 32-bit computers, use the following downloadlink from the MIT Kerberos website: http://web.mit.edu/kerberos/dist/kfw/4.0/kfw-4.0.1-i386.msi

The 64-bit installer includes both 32-bit and 64-bit libraries. The 32-bitinstaller includes 32-bit libraries only.

2. To run the installer, double-click the .msi file that you downloaded in step 1.3. Follow the instructions in the installer to complete the installation process.4. When the installation completes, click Finish

Setting Up the Kerberos Configuration File

Settings for Kerberos are specified through a configuration file. You can set up theconfiguration file as a .INI file in the default location—the C:\ProgramData\MIT\Kerberos5directory— or as a .CONF file in a custom location.

Normally, the C:\ProgramData\MIT\Kerberos5 directory is hidden. For information aboutviewing and using this hidden directory, refer to Microsoft Windows documentation.

For more information on configuring Kerberos, refer to the MIT Kerberosdocumentation.

To set up the Kerberos configuration file in the default location:1. Obtain a krb5.conf configuration file from your Kerberos administrator.

ORObtain the configuration file from the /etc/krb5.conf folder on the computer that ishosting the Shark Server 2 or Spark Thrift Server instance.

2. Rename the configuration file from krb5.conf to krb5.ini

www.simba.com 43

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide

3. Copy the krb5.ini file to the C:\ProgramData\MIT\Kerberos5 directory and overwritethe empty sample file.

To set up the Kerberos configuration file in a custom location:1. Obtain a krb5.conf configuration file from your Kerberos administrator.

ORObtain the configuration file from the /etc/krb5.conf folder on the computer that ishosting the Shark Server 2 or Spark Thrift Server instance.

2. Place the krb5.conf file in an accessible directory and make note of the full pathname.

3. Click the Start button , then right-click Computer, and then click Properties4. Click Advanced System Settings5. In the System Properties dialog box, click the Advanced tab and then click

Environment Variables6. In the Environment Variables dialog box, under the System variables list, click New7. In the New System Variable dialog box, in the Variable name field, type KRB5_

CONFIG8. In the Variable value field, type the absolute path to the krb5.conf file from step 2.9. Click OK to save the new variable.10.Ensure that the variable is listed in the System variables list.11.Click OK to close the Environment Variables dialog box, and then click OK to close

the System Properties dialog box.

Setting Up the Kerberos Credential Cache File

Kerberos uses a credential cache to store and manage credentials.

To set up the Kerberos credential cache file:1. Create a directory where you want to save the Kerberos credential cache file. For

example, create a directory named C:\temp2. Click the Start button , then right-click Computer, and then click Properties3. Click Advanced System Settings4. In the System Properties dialog box, click the Advanced tab and then click

Environment Variables5. In the Environment Variables dialog box, under the System variables list, click New6. In the New System Variable dialog box, in the Variable name field, type

KRB5CCNAME7. In the Variable value field, type the path to the folder you created in step 1, and then

append the file name krb5cache. For example, if you created the folder C:\temp instep 1, then type C:\temp\krb5cache

www.simba.com 44

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide

krb5cache is a file (not a directory) that is managed by the Kerberossoftware, and it should not be created by the user. If you receive apermission error when you first use Kerberos, ensure that the krb5cache filedoes not already exist as a file or a directory.

8. Click OK to save the new variable.9. Ensure that the variable appears in the System variables list.10.Click OK to close the Environment Variables dialog box, and then click OK to close

the System Properties dialog box.11.To ensure that Kerberos uses the new settings, restart your computer.

Obtaining a Ticket for a Kerberos Principal

A principal refers to a user or service that can authenticate to Kerberos. To authenticate toKerberos, a principal must obtain a ticket by using a password or a keytab file. You canspecify a keytab file to use, or use the default keytab file of your Kerberos configuration.

To obtain a ticket for a Kerberos principal using a password:1. Click the Start button , then click All Programs, and then click the Kerberos for

Windows (64-bit) or Kerberos for Windows (32-bit) program group.2. ClickMIT Kerberos Ticket Manager3. In the MIT Kerberos Ticket Manager, click Get Ticket4. In the Get Ticket dialog box, type your principal name and password, and then click

OK

If the authentication succeeds, then your ticket information appears in the MIT KerberosTicket Manager.

To obtain a ticket for a Kerberos principal using a keytab file:1. Click the Start button , then click All Programs, then click Accessories, and then click

Command Prompt2. In the Command Prompt, type a command using the following syntax:

kinit -k -t keytab_path principal

keytab_path is the full path to the keytab file. For example:C:\mykeytabs\myUser.keytabprincipal is the Kerberos user principal to use for authentication. For example:[email protected]

3. If the cache location KRB5CCNAME is not set or used, then use the -c option of thekinit command to specify the location of the credential cache. In the command, the -cargument must appear last. For example:kinit -k -t C:\mykeytabs\[email protected] -c C:\ProgramData\MIT\krbcache

Krbcache is the Kerberos cache file, not a directory.

www.simba.com 45

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide

To obtain a ticket for a Kerberos principal using the default keytab file:

For information about configuring a default keytab file for your Kerberosconfiguration, refer to the MIT Kerberos documentation.

1. Click the Start button , then click All Programs, then click Accessories, and then clickCommand Prompt

2. In the Command Prompt, type a command using the following syntax:kinit -k principal

principal is the Kerberos user principal to use for authentication. For example:[email protected]

3. If the cache location KRB5CCNAME is not set or used, then use the -c option of thekinit command to specify the location of the credential cache. In the command, the -cargument must appear last. For example:kinit -k -t C:\mykeytabs\[email protected] -c C:\ProgramData\MIT\krbcache

Krbcache is the Kerberos cache file, not a directory.

www.simba.com 46

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide

Appendix C Driver Configuration Options

Appendix C Driver Configuration Options on page 47 lists the configuration optionsavailable in the Simba ODBC Driver with SQL Connector for Apache Spark alphabeticallyby field or button label. Options having only key names—not appearing in the userinterface of the driver—are listed alphabetically by key name.

When creating or configuring a connection from a Windows computer, the fields andbuttons are available in the Simba Spark ODBC Driver Configuration tool and the followingdialog boxes:

l Simba Spark ODBC Driver DSN Setupl Advanced Optionsl Server Side Properties

When using a connection string or configuring a connection from a Linux or Mac computer,use the key names provided.

You can pass in configuration options in your connection string or set them in yourodbc.ini and simba.sparkodbc.ini files. Configuration options set in asimba.sparkodbc.ini file apply to all connections, whereas configuration optionspassed in in the connection string or set in an odbc.ini file are specific to aconnection. Configuration options passed in using the connection string takeprecedence over configuration options set in odbc.ini. Configuration options set inodbc.ini take precedence over configuration options set in simba.sparkodbc.ini

Configuration Options Appearing in the User Interface

The following configuration options are accessible via the Windows user interface for theSimba ODBC Driver with SQL Connector for Apache Spark, or via the key name whenusing a connection string or configuring a connection from a Linux or Mac computer:

l Allow Common Name Host NameMismatch on page 48

l Apply Properties with Queries onpage 48

l Async Exec Poll Interval on page 49l Binary Column Length on page 49l Convert Key Name to Lower Caseon page 49

l Database on page 50l Decimal Column Scale on page 50l Default String Column Length onpage 50

l Host FQDN on page 52l HTTP Path on page 52l Mechanism on page 52l Password on page 53l Port on page 54l Realm on page 54l Rows Fetched Per Block on page54

l Service Name on page 55l Show System Table on page 55l Spark Server Type on page 55

www.simba.com 47

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide

l Driver Config Take Precedence onpage 50

l Fast SQLPrepare on page 51l Get Tables With Query on page 51l Host on page 51

l Trusted Certificates on page 56l Unicode SQL Character Types onpage 56

l Use Async Exec on page 57l Use Native Query on page 57l User Name on page 58

Allow Common Name Host Name Mismatch

Key Name Default Value Required

CAIssuedCertNamesMismatch Clear (0) No

Description

When this option is enabled (1), the driver allows a CA-issued SSL certificate name to notmatch the host name of the Spark server.

When this option is disabled (0), the CA-issued SSL certificate name must match the hostname of the Spark server.

This option is applicable only to the User Name and Password (SSL) and HTTPSauthentication mechanisms.

Apply Properties with Queries

Key Name Default Value Required

ApplySSPWithQueries Selected (1) No

Description

When this option is enabled (1), the driver applies each server-side property by executinga set SSPKey=SSPValue query when opening a session to the Spark server.

When this option is disabled (0), the driver uses a more efficient method for applyingserver-side properties that does not involve additional network round-tripping. However,some Shark Server 2 or Spark Thrift Server builds are not compatible with the moreefficient method.

When connecting to a Shark Server instance, ApplySSPWithQueries is alwaysenabled.

www.simba.com 48

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide

Async Exec Poll Interval

Key Name Default Value Required

AsyncExecPollInterval 100 No

Description

The time in milliseconds between each poll for the query execution status.

“Asynchronous execution” refers to the fact that the RPC call used to execute a queryagainst Spark is asynchronous. It does not mean that ODBC asynchronous operations aresupported.

This option is applicable only to HDInsight clusters.

Binary Column Length

Key Name Default Value Required

BinaryColumnLength 32767 No

Description

The maximum data length for BINARY columns.

By default, the columns metadata for Spark does not specify a maximum data length forBINARY columns.

Convert Key Name to Lower Case

Key Name Default Value Required

LCaseSspKeyName Selected (1) No

Description

When this option is enabled (1), the driver converts server-side property key names to alllower case characters.

When this option is disabled (0), the driver does not modify the server-side property keynames.

www.simba.com 49

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide

Database

Key Name Default Value Required

Schema default No

Description

The name of the database schema to use when a schema is not explicitly specified in aquery. You can still issue queries on other schemas by explicitly specifying the schema inthe query.

To inspect your databases and determine the appropriate schema to use, type theshow databases command at the Spark command prompt.

Decimal Column Scale

Key Name Default Value Required

DecimalColumnScale 10 No

Description

The maximum number of digits to the right of the decimal point for numeric data types.

Default String Column Length

Key Name Default Value Required

DefaultStringColumnLength 255 No

Description

The maximum data length for STRING columns.

By default, the columns metadata for Spark does not specify a maximum data length forSTRING columns.

Driver Config Take Precedence

Key Name Default Value Required

DriverConfigTakePrecedence Clear (0) No

www.simba.com 50

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide

Description

When this option is enabled (1), driver-wide configurations take precedence overconnection and DSN settings.

When this option is disabled (0), connection and DSN settings take precedence instead.

Fast SQLPrepare

Key Name Default Value Required

FastSQLPrepare Clear (0) No

Description

When this option is enabled (1), the driver defers query execution to SQLExecute.

When this option is disabled (0), the driver does not defer query execution to SQLExecute.

When using Native Query mode, the driver will execute the Spark SQL query to retrievethe result set metadata for SQLPrepare. As a result, SQLPrepare might be slow. If theresult set metadata is not required after calling SQLPrepare, then enable FastSQLPrepare.

Get Tables With Query

Key Name Default Value Required

GetTablesWithQuery Clear (0) No

Description

When this option is enabled (1), the driver uses the SHOW TABLES query to retrieve thenames of the tables in a database.

When this option is disabled (0), the driver uses the GetTables Thrift API call to retrievethe names of the tables in a database.

This option is applicable only when connecting to a Shark Server 2 or Spark ThriftServer instance.

Host

Key Name Default Value Required

HOST None Yes

www.simba.com 51

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide

Description

The IP address or host name of the Spark server.

Host FQDN

Key Name Default Value Required

KrbHostFQDN None Yes, if the authenticationmechanism is Kerberos

Description

The fully qualified domain name of the Shark Server 2 or Spark Thrift Server host.

HTTP Path

Key Name Default Value Required

HTTPPath None Yes, if the authenticationmechanism is one of thefollowing:

l Windows AzureHDInsight Emulator

l Windows AzureHDInsight Service

l HTTPl HTTPS

Description

The partial URL corresponding to the Spark server on HDInsight, HTTP, or HTTPSauthentication mechanisms.

Mechanism

Key Name Default Value Required

AuthMech No Authentication (0) No

Description

The authentication mechanism to use.

www.simba.com 52

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide

Select one of the following settings, or set the key to the corresponding number:l No Authentication (0)l Kerberos (1)l User Name (2)l User Name and Password (3)l User Name and Password (SSL) (4)l Windows Azure HDInsight Emulator (5)l Windows Azure HDInsight Service (6)l HTTP (7)l HTTPS (8)

Password

Key Name Default Value Required

PWD None Yes, if the authenticationmechanism is one of thefollowing:

l User Name andPassword

l User Name andPassword (SSL)

l Windows AzureHDInsight Service

l HTTPl HTTPS

Description

The password corresponding to the user name that you provided in the User Name field(the UID key).

www.simba.com 53

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide

Port

Key Name Default Value Required

PORT l non-HDInsightclusters—10000

l Windows AzureHDInsight Emu-lator—10001

l Windows AzureHDInsightService—443

Yes

Description

The number of the TCP port on which the Spark server is listening.

Realm

Key Name Default Value Required

KrbRealm Depends on yourKerberos configuration.

No

Description

The realm of the Shark Server 2 or Spark Thrift Server host.

If your Kerberos configuration already defines the realm of the Shark Server 2 or SparkThrift Server host as the default realm, then you do not need to configure this option.

Rows Fetched Per Block

Key Name Default Value Required

RowsFetchedPerBlock 10000 No

Description

The maximum number of rows that a query returns at a time.

Any positive 32-bit integer is a valid value, but testing has shown that performance gainsare marginal beyond the default value of 10000 rows.

www.simba.com 54

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide

Service Name

Key Name Default Value Required

KrbServiceName None Yes, if the authenticationmechanism is Kerberos

Description

The Kerberos service principal name of the Spark server.

Show System Table

Key Name Default Value Required

ShowSystemTable Clear (0) No

Description

When this option is enabled (1), the driver returns the spark_system table for catalogfunction calls such as SQLTables and SQLColumns

When this option is disabled (0), the driver does not return the spark_system table forcatalog function calls.

Spark Server Type

Key Name Default Value Required

SparkServerType Shark Server (1) No

Description

Select Shark Server or set the key to 1 if you are connecting to a Shark Server instance.

Select Shark Server 2 or set the key to 2 if you are connecting to a Shark Server 2instance.

Select Spark Thrift Server or set the key to 3 if you are connecting to a Spark ThriftServer instance.

www.simba.com 55

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide

Trusted Certificates

Key Name Default Value Required

TrustedCerts The cacerts.pem file in thelib folder or subfolderwithin the driver'sinstallation directory.

The exact file path variesdepending on the versionof the driver that isinstalled. For example, thepath for the Windowsdriver is different from thepath for the MacOS X driver.

No

Description

The location of the PEM file containing trusted CA certificates for authenticating the Sparkserver when using SSL.

If this option is not set, then the driver will default to using the trusted CA certificates PEMfile installed by the driver.

This option is applicable only to the following authentication mechanisms:l User Name and Password (SSL)l Windows Azure HDInsight Servicel HTTPS

Unicode SQL Character Types

Key Name Default Value Required

UseUnicodeSqlCharacterTypes Clear (0) No

Description

When this option is enabled (1), the driver returns SQL_WVARCHAR for STRING andVARCHAR columns, and returns SQL_WCHAR for CHAR columns.

When this option is disabled (0), the driver returns SQL_VARCHAR for STRING andVARCHAR columns, and returns SQL_CHAR for CHAR columns.

www.simba.com 56

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide

Use Async Exec

Key Name Default Value Required

EnableAsyncExec Clear (0) No

Description

When this option is enabled (1), the driver uses an asynchronous version of the API callagainst Spark for executing a query.

When this option is disabled (0), the driver executes queries synchronously.

Use Native Query

Key Name Default Value Required

UseNativeQuery Clear (0) No

Description

When this option is enabled (1), the driver does not transform the queries emitted by anapplication, so the native query is used.

When this option is disabled (0), the driver transforms the queries emitted by anapplication and converts them into an equivalent from in Spark SQL.

If the application is Spark-aware and already emits Spark SQL, then enable thisoption to avoid the extra overhead of query transformation.

www.simba.com 57

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide

User Name

Key Name Default Value Required

UID For User Nameauthentication only, thedefault value isanonymous

No, if the authenticationmechanism is User Name

Yes, if the authenticationmechanism is one of thefollowing:

l User Name andPassword

l User Name andPassword (SSL)

l Windows AzureHDInsight Service

l HTTPl HTTPS

Description

The user name that you use to access Shark Server 2 or Spark Thrift Server.

Configuration Options Having Only Key Names

The following configuration options do not appear in the Windows user interface for theSimba ODBC Driver with SQL Connector for Apache Spark and are only accessible whenusing a connection string or configuring a connection from a Linux/Mac OS X computer:

l Driver on page 58l SSP_ on page 59

Driver

Default Value Required

The default value varies depending on theversion of the driver that is installed. Forexample, the value for the Windows driveris different from the value of the MacOS X driver.

Yes

www.simba.com 58

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide

Description

The name of the installed driver (Simba Spark ODBC Driver) or the absolute path of theSimba ODBC Driver with SQL Connector for Apache Spark shared object file.

SSP_

Default Value Required

None No

Description

Set a server-side property by using the following syntax, where SSPKey is the name of theserver-side property to set and SSPValue is the value to assign to the server-sideproperty:SSP_SSPKey=SSPValue

For example:SSP_mapred.queue.names=myQueue

After the driver applies the server-side property, the SSP_ prefix is removed from the DSNentry, leaving an entry of SSPKey=SSPValue

The SSP_ prefix must be upper case.

www.simba.com 59

Simba ODBC Driver with SQL Con-nector for Apache Spark Installation and Configuration Guide