Big Data

Get Involved. Join the Conversation.


    Robin Chatterjee
    Best way to Do BDA Dr without devoting a machine
    Topic posted August 21, 2018 by Robin ChatterjeeBlack Diamond: 60,000+ Points, last edited September 25, 2018 by Arijit ChakrabortyBronze Trophy: 5,000+ Points, tagged Big Data Appliance, Tip 
    420 Views, 12 Comments
    Best way to Do BDA Dr without devoting a machine
    Hi I would like to replicate Prod to DR which using Dr as DEV.

    Is it possible to both replicate my production BDA to my DEV/ Dr BDa and also use the full DR  BDA for DEV work ? I don't want to split my DEV box into two clusters. so can I just replicate select  production HDFS Directories Using cloudera manager replication ? What else might be needed for a full flged solution? I was looking for the cross site BDA Dr whitepaper but it seems the only one that has been published so far is the one within the same DC.





    • Jean-Pierre Dijcks

      Hi Robin,

      Well, in general the answer is yes, this is technically possible. Whether it is advisable is a different matter.

      Here are some scenarios we have seen customers use:

      • Have a primary cluster (PRD) and a secondary (DR). Replicate from PRD to DR. Run Analytics workloads, as well as Tests, QA, Performance test on the DR system. This way you use DR for productive work (Analytics) while having a DR copy. This assumes that the replication is done timely enough for the analytics to be effective for the business. 
      • Have an active - active set up with data in sync across the two clusters and run workloads across as you see fit - essentially load balancing. Customer implement using:
        • WanDisco - a partner solution to manage the synchronization
        • Dual Ingest with ETL tools, pushing the data into both clusters. ETL tools are "responsible" for ensuring clusters are in sync
        • Kafka in front of the clusters and both cluster subscribe to the topics, leveraging guaranteed delivery in Kafka to ensure sync (at some point)

      Now, one of the problems you will run into is that you do not have a cluster that you can use to verify and test upgrades and application testing. In other words, when moving to the next CDH version on the BDA, how do you verify that everything works? If you use DR, and it somehow breaks something you are in a bit of a pickle.

      Regarding development, I assume you are developing analytics or pipelines or stuff like that. I think you can use DR for that. You will just need to make sure that none of that work EVER modifies the original data sets. Which can be achieved with structuring the directories and ACLs to avoid deletes etc.

      All in all, as with all of these decisions, be mindful of saving money and then losing a lot of it by breaking something. Most of our production installations are really a 3 system setup. Prod-DR-Dev/Sandbox. Dev/Sandbox is used for new versions, new models, new apps, upgrades, betas etc. Dr for analytics that are not super time sensitive in data latency and prod for ingest, ETL, production apps etc.

      Comments welcome!


      • Jean-Pierre Dijcks

        To add one more thought - wanted to not dilute the above with a cloud pitch :-)

        If you are looking to DR data (so not a workload failover), you could consider using something like Object Storage in Oracle Cloud. BDA comes with Big Data Manager (as of 4.12) which enables drag and drop file transfer to Object Storage. This generates a Spark application that parallelizes the data flow to Object Storage.  Once there, you can easily load it into Big Data Cloud Service or work on it with Autonomous Data Warehouse Cloud (as an example).

        Big Data Cloud Service could also be used as a dev environment, and you could also BDR data directly into BDCS...

        Some of this is in this OBE: Ingesting and backing up Data


    • Robin Chatterjee

      Thank You for all the options. In this case we already have 2 BDA machines so cloud option isn't viable. 

    • Jean-Pierre Dijcks

      Makes sense. I would just chart out the requirements, and then design a solution on top of that with the two BDAs.

      One additional thought, you could consider Big Data Lite VM as a functional development environment:



    • Viv

      Hi JP

      I have similar requirement where I would like to create a sandbox env but not looking to buy another BDA . We recently upgraded our BDA and had few unpleasant surprises . I am working with Alexey but looking to have some guidance on how to create a miniBDA on commodity  for upgrade testubg. Let me know your thoughts .

      BDA lite does not seem to catch up with BDA upgrades 


      • Jean-Pierre Dijcks

        We are a little behind in BDLite... due to a switch to OL7 on the VM etc. 

        Would a cloud solution be an option? Imagine doing a short lived cluster that would be small, secure and HA (e.g. your small BDA) but in cloud. Or is that not possible in your environments and it would have to be on-prem?

        • Viv

          I am happy to look at cloud solution as well . Let me know what you have in mind , Alexey F is fully across our requirement and would be good to know your thoughts . Also what would be good to know if the if that starter BDA In cloud comes in what flavour commercially ? Will it work under UC model ??

    • candan pehlivanoglu



      Actually, we are also planing to replicate our Prod env to our Sandbox env. Both are equally configured bda appliances.
      Yet, Our prod bda is working  with one exadata machine and connected with bds 3.1. Whenever I switch database to standby, I also can able to  switch the bds to an other bda applience. Do you know any quick way to do this rather than configure bds again? Or can I configure bds on standby database?


    • Thomas Luckenbach

      Hi Candan,

      I work for WANdisco, a solution provider outlined early in this thread by JP. To cut to the chase, we have a product focused on your problem - the data replication of Hadoop. Unlike BDR and other point-in-time solutions, WANdisco Fusion (a BDA Optimized solution) provides continuous replication of select HDFS directories (as well as Hive meta-data) data between 2 or more Hadoop clusters. This solution augments/compliments data ingestion solutions such that data created/modified on say cluster-1 (Prod) will be duplicated in near-real time to keep cluster-2 (Sandbox) in sync even as that file is being created on cluster 1. The point of course is when you need to switch users to the alternate cluster the data is ready to go. This can shift the architecture/model from a Primary-to-Secondary-passive backup to perhaps a peering model to leverage all the assets - going back to original scenario posed by Robin.

      I have attached the the joint white paper which describes the Maximum Availability Architecture for BDA using WANdisco Fusion.

      I hope this helps.


      • candan pehlivanoglu

        Hello Thomas;


        Acctually, our problem is not replicate the data between bdas. Our problem is after switching the rdbms database, we need to switch bda and we can able to use big datasql without losing to much time. So, we need to find a way to switch bigdata sql to oracle standby nodes.

    • Marty Gubar

      Hi Candan -

      A database can be configured to connect to multiple hadoop clusters - one of those clusters will be designated as the default.  As part of your external table definition, you can specify the cluster that the table will access.  When this is not specified, it will use the "default" cluster.

      In your case, create external tables without associating it with a cluster.  When you need to switch the default, you will need to update a config file and a couple of database objects.

      1.  Update the default cluster in the configuration file 

      This file is within your $ORACLE_HOME/bigdatasql directory.  The actually directory has changed between releases, so you can:

      cd $ORACLE_HOME/bigdatasql
      find . -name

      In that file, update the default cluster parameter:

      bigdata.cluster.default=<name of new default cluster>

      2.  Update database link to point to the new cluster

      Database links are created and used by extproc for connecting to the cluster.  You will now update the database link for the default cluster.  In SQLPlus:

      SQL> select db_link,host from all_db_links;

      Find the database link for the cluster that will now be the default.  This will be obvious from both the DB_LINK and the HOST values.  Drop and recreate the default database link.  Use the same names as before - it should be similar to:

      For > 3.1

      drop database link BDSQL$_DEFAULT_CLUSTER;
      create database link BDSQL$_DEFAULT_CLUSTER using '<copy the content from the select output for the new default cluster>';

      For 3.1, make sure the database link is public:

      drop public database link BDSQL$_DEFAULT_CLUSTER;
      create public database link BDSQL$_DEFAULT_CLUSTER using '<copy the content from the select output for the new default cluster>';


      Then, restart your extproc.

      Hope this helps!