Nl HUG 2016 Feb Hadoop security from the trenches
-
Upload
bolke-de-bruin -
Category
Data & Analytics
-
view
1.781 -
download
0
Transcript of Nl HUG 2016 Feb Hadoop security from the trenches
Hadoop Security from the TrenchesBolke de BruinChief Wizard
This is not going to be a perfect talk• It will be incomplete (squeezed for time)• Probably without humor (I am just really bad in telling jokes)• I have a disclaimer (work for a Bank)• A lot of text in Orange (ING is Oranje)
Agenda
• Security today in Hadoop• Kerberos (In depth)• Policy based access• Lineage (A bit)• Encryption
Information security principals
Confidentiality• Information is not
made available or disclosed tounauthorizedindividuals and,entities or processes
Integrity• Maintaining and
assuring theaccuracy andcompleteness of data over its entirelifecycle
Availability•Data must beavailable when itis needed
Unfortunately most of the attention in Hadoop goes to confidentiality
Security today in Hadoop
AuthenticationWho am I?
Kerberos
Apache Knox
AuthorizationWhat can I do?
ApacheRanger
Apache Sentry
AuditWhat did I do?
Apache Ranger
ClouderaNavigator
Data ProtectionCan someone read
my data?
SSL
SASL
KMS
Data GovernanceWhere did my data
come from andwhere is it going?
Apache Atlas
ClouderaNavigator
Identity Management
Taming Kerberos, the ferocious three-headed guard dog of Hadoop
Typical workflow
Kerberized workflow
Use ticket
Hive getsservice ticket HDFS gets
service ticket
Kerberos has great advantages…
• Requires that each client, each request prove it’s identity• Does not require a user to enter password everytime a service registered• Works across operating systems• Kerberos assumes that network connections rather than servers and workstations
are the weak link in network security
• Did you know that Active Directory is just Kerberos+LDAP?
…but its perceived complexity has stopped implementation
• AS, KDC, TGS, SS, TGT, KINIT, KEYTAB, KADMIN So many abbrevations…• But you just need to remember a few: kinit, keytab, kdc
• Synchronization of host clocks required• What wait? You didn’to do that yet? Your local cloud provider already does this for you.
• Separate user databases if combined with LDAP or PAM• Well there is Active Directory and there is FreeIPA
• Tool Xxx is not kerberized and I really need it• Insecure don’t use it or add patches yourself. Yeah OpenSource!
Ehh FreeIPA?
Looks familiar doesn’t it? Oh yes this is Active Directory!
Integration in an Enterprise environment
• Fully integrated with Operating System and Hadoop
• UserIDs are the same, shared andimmediate
• Can use PAM
• YARN, HDFS acls start working out of the box as local users just exist That is the big stuff!
Installing is difficult right?
• Server• # yum –y install ipa-server• # ipa-server-install
• Client• # yum –y install ipa-client• # ipa-client-install
Support in Hadoop distributions is slightly lagging
Quite easy actually: gen_credentials.shjust needs to be adjusted:http://blog.godatadriven.com/samba-configuration.html (for IPA it needs to beadjusted)
https://github.com/HariSekhon/tools/blob/master/ambari_freeipa_kerberos_setup.pl
Written by an ex cloudera guy ;-)
Caveats
• Trusted domains deliver users with “username@REALM”, Hadoop and Hive filter on ‘@’• See: https://issues.apache.org/jira/browse/HADOOP-12751• See: https://issues.apache.org/jira/browse/HIVE-12981
• Workaround: convert @ to _ by means of sssd• full_name_format = %1$s_%2$s• re_expression =
(((?P<Name>[^@]+)_(?P<Domain>.+$))|((?P<Domain>[^\\]+)\\(?P<Name>.+$))|((?P<Name>[^@]+)@(?P<Domain>.+$))|(^(?P<Name>[^@\\]+)$))
• Or just wait for the patches to land
Data access policies and auditing with Ranger
How are policies applied?
Where is Spark?
Active policies
Caveats
• Ranger (but also Sentry) feels like slapped on security. Just usable, but barely• User synchronization can be very slow with many users due to architecture issues• Unix synchronization and authentication is using /etc/passwd /etc/group instead of NSS and PAM
• https://issues.apache.org/jira/browse/RANGER-842• https://issues.apache.org/jira/browse/RANGER-827• If these patches land syncing will be much faster for IPA/SSSD enabled systems
• No real Spark roadmap, just spark-sql. This also goes for Sentry• Doesn’t manage HDFS ACLS and requires Hive user access… defeating end to end security
Data Governance
• Why? • We need to be able to pinpoint what data resides where, why, what happened with it.• Why?• Cause you might want us to remove your data• … and the regulator says so
Encryption
• Data at rest• Used if you don’t trust your physical infrastructure. Cloud!• Only our highest confidentiality levels require it, we are not at that level so we don’t use it
• Data in transit• Data across untrusted networks. Cloud?• Perimeter security solves a lot of these issues, you take a significant performance hit of around 20% if you
enable it within your cluster• For ETL or data ingestion then it becomes more reasonable• For us it is enabled for access TO the cluster NOT WITHIN
• Data democratization • Use case: allow some data scientists to see the original data and some of the masked/anonimized data• We are tinkering with this
An example architecture
We are hiring! [email protected]
24
Frank DerksJohn Muller Pooja Rao Hylke Hendriksen
Giovanni LanziniFabian Jansen Hanneke van Veldhuizen Johan Witman
Wendell KulingJonas Ahrendt Bolke de Bruin Ivo Everts
Doron Reuter
Zhe Sun