Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
727 views
in Technique[技术] by (71.8m points)

etl - How to join two CSVs with Apache Nifi

I'm looking into ETL tools (like Talend) and investigating whether Apache Nifi could be used. Could Nifi be used to perform the following:

  1. Pick up two CSV files that are placed on local disk
  2. Join the CSVs on a common column
  3. Write the joined CSV to disk

I've tried setting up a job in Nifi, but couldn't see how to perform the join of two separate CSV files. Is this task possible in Apache Nifi?

It looks like the QueryDNS processor could be used to perform enrichment of one CSV file using the other, but that seems to be over-complicated for this use case.

Here's an example of the input CSVs, which need to be joined on state_id:

Input files

customers.csv

id | name | address      | state_id
---|------|--------------|---------
1  | John | 10 Blue Lane | 100
2  | Bob  | 15 Green St. | 200

states.csv

state_id | state
---------|---------
100      | Alabama
200      | New York

Output file

output.csv

id | name | address      | state
---|------|--------------|---------
1  | John | 10 Blue Lane | Alabama
2  | Bob  | 15 Green St. | New York
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Apache NiFi is more of a dataflow tool and not really made to perform arbitrary joins of streaming data. Typically those types of operations are better suited to stream processing systems like Storm, Flink, Apex, etc, or ETL tools.

The types of joins that NiFi can do well are enrichment look ups where there is a fixed size lookup dataset, and for each record in the incoming data you use the lookup dataset to retrieve some value. For example, in your case there could be a processor called LookUpState which has a property "State Data" which points to a file containing all the states, then the customers.csv could be the input to this processor.

A community member started a project to make a generic lookup service for NiFi: https://github.com/jfrazee/nifi-lookup-service


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...