We use cookies to give you the best experience on our website. If you continue to browse, then you agree to our privacy policy and cookie policy. Image for the cookie policy date
close icon

Read File into HDFS

Hi, I am running both Big Data Studio and Data Integration Platform on the same virtual machine.
I would like to create a DataFlow task that reads in an XLS file from C:\TEMP_FTP and insterts this file into HDFS on my localhost? Is there any documentation to achieve this?
Many thanks!

9 Replies

KS Karthikeyan SankaraVadivel Syncfusion Team August 23, 2017 03:52 AM UTC

Hi Tim, 

Thank you for contacting Syncfusion Support.  

In Data integration platform, we can able to read the files using the GetFile processor and insert those files in to HDFS using PutHDFS processor.  

We have prepared the workflow that met your requirement, please find the below link that contains the dataflow template and documentation for the dataflow.  


Please let us know if you need further assistance on this.  

Regards, 
Karthikeyan S  



TM Tim Maher August 23, 2017 09:31 AM UTC

Hello Karthikeyan,

This worked perfectly - many thanks!

If possible, I would like to add two more steps onto this.

Firstly, in the table I have imported into HDFS which is called Shifts , I have two columns called Shift Type and Total Cost.  Could I create a second table called 'Measures' by taking these to columns and creating  a second table with a corresponding column called Shift Type and three columns which take the Average, Median, Minimum and Maximum of Total Cost?

When this has been done I would then like to move the copy the Shifts and Measures tables into Hive on the same server which I can then read into my Analytics Platform - Qliksense.


I hope this is possible,

Best Wishes,

Tim



KS Karthikeyan SankaraVadivel Syncfusion Team August 24, 2017 11:46 AM UTC

Hi Tim, 

Yes this can be possible to create a hive table ‘Measures’ form another table ‘Shift’, using the Hive built-in functions we can calculate Mean, Median, Minimum, Maximum of the TotalCost columns.  

In Data Integration Platform, we have SelectHiveQL processor to execute the hive(HQL) SELECT query against a Hive database. Query result will be converted to Avro or CSV format. Then the result will be moved to the “Measures” table using PutHiveQL processor.  

We have prepared the sample workflow, please find the below link that contains the dataflow template and documentation for the dataflow.   


Please let me know if you need further assistance on this.  

Regards, 
Karthikeyan S  



TM Tim Maher August 24, 2017 12:25 PM UTC

Hi Karthikeyan,

Thanks for this solution.  Unfortunately, it is failing at the first step with an error :

.....Cannot create PoolableConnectionFactory (could not establish connection to jdbc:hive2://localhost:10000/default;auth=noSasl;:null). No Flowfileto route to failure:....

Can you assist please?

Many many thanks,


Tim



KS Karthikeyan SankaraVadivel Syncfusion Team August 24, 2017 12:53 PM UTC

Hi Tim, 

We suspect that the Hive Server is stopped in your machine. Please make sure that your Hive server is in running state. You can see the state of Hadoop ecosystem services in the Bigdata Studio Local Service manager.  

 

if not, Please Restart your services from the Service Manager.  
 

Once you have restarted the Hadoop ecosystem services, Please reinitiate your Data integration workflow.   

Please let me know if you have further assistance.  

Regards, 
Karthikeyan S 



TM Tim Maher August 24, 2017 01:42 PM UTC

Hi,

Thanks for your quick reply.  The services were running correctly. 

screenshot



 I have just thought that as part of another support thread I was advised to update the authentication parameters within the hive-site.xml file as I had to connect to the hive data from a separate analytics platform.  Would this have an effect ?  The parameters are now :


    hive.server2.authentication.syncfusion.username
    tim
 

 
    hive.server2.authentication.syncfusion.password
    password
 

Regards,

Tim


NK Nandhini K Syncfusion Team August 25, 2017 10:07 AM UTC

Hi Tim, 

Yes. You then need to provide credentials in the database connection URL. 
In the provided document you may find a section to set “Database Connection URL” for Hive like below, 
   
 

Where you can set authentication parameters as follows, 
Property 
Value 
Database Connection URL 
jdbc:hive2://localhost:10000/default; 
Database User 
tim 
Password 
password 

If the above configuration doesn’t work, please set authentication parameters in connection URL itself as below, 
Property 
Value 
Database Connection URL 
jdbc:hive2://localhost:10000/default; -n tim -p password 

Please let us know if you have any queries. 

Regards, 
Nandhini K 



TM Tim Maher August 25, 2017 04:32 PM UTC

Hi,


Thanks for your reply.  This is now working although it seems to fail on the last step :

HiveSQLException: Error while compiling statement: FAILED: ParseException line 1:62 cannot recognize input near ','',' '"Total_Cost"' in value row constructor; it is possible that retrying the operation will succeed, so routing to retry: org.apache....................

I also have another question.  In the 'Insert in to Hive Table' step,  does the table that it is inserting into have to already exist ? or will it append to an existing table, overwrite an existing table or create a table is it doesn't exist?


Thanks you for your help



KS Karthikeyan SankaraVadivel Syncfusion Team August 28, 2017 09:55 AM UTC

Hi Tim, 

HiveSQLException: Error while compiling statement: FAILED: ParseException line 1:62 cannot recognize input near ','',' '"Total_Cost"' in value row constructor; it is possible that retrying the operation will succeed, so routing to retry: org.apache.................... 
Make sure that your Insert query is properly constructed according to your Measures table. Use ‘ ‘ (single quotations)  for String column types in hive.  

For eg.   
Replacement Value in the Constructing queries processor  - insert into Measures values('${csv.2:getDelimitedField(1)}',${csv.2:getDelimitedField(2)},${csv.2:getDelimitedField(3)},${csv.2:getDelimitedField(4)},${csv.2:getDelimitedField(5)}) 

Query Generated - insert into Measures values('"Type4"',600.0,600.0,600.0,600.0)  

You can view the Query Generated from the “Constructing queries” processor using Data Provenance
I also have another question.  In the 'Insert in to Hive Table' step,  does the table that it is inserting into have to already exist ? or will it append to an existing table, overwrite an existing table or create a table is it doesn't exist? 
The “Insert in to Hive Table” processor just insert the records in to the Hive table. If the table Measures is not present in the Hive server, we need to create the table manually.  

If the table exists, It will not overwrite or create the table, It will just append the records in the existing table.  

Please let me know if you need further assistance.  

Regards, 
Karthikeyan S  


Loader.
Live Chat Icon For mobile
Up arrow icon