Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
468 views
in Technique[技术] by (71.8m points)

csv - Create report and upload to server for download

I have a CSV file thats 300MB and has over 1/2 million entries. I want to run reports and make these reports available for download.

Here's the data structure:

A;B;C;D;E;F;G

What I'd like to do is create a separate file that contains the top 50 rows when the file has been sorted by column B.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

I'd recommend using Solr to process your data, it supports the indexing of CSV data.

Solr solves the problems associated with uploading, downloading and searching the data. It's performance is such that the index can be easily updated or built again from scratch.

Example

Install Solr

Read the documentation for how to install Solr. The following is a "kick-start" to make this demo work on Linux:

wget http://www.apache.org/dist/lucene/solr/3.5.0/apache-solr-3.5.0.tgz
tar zxvf apache-solr-3.5.0.tgz
cd apache-solr-3.5.0/example
java -jar start.jar

Solr admin screen available on the following URL

http://localhost:8983/solr/admin/

Generate sample data

Wrote a groovy script to generate a sample CSV file.

new File("data.csv").withWriter { writer ->
    writer.println "id;A_s;B_i;C_s;D_s;E_s;F_s"

    for (i in 1..500000) {
        writer.println "${i};${i*10};${i*20};${i*30};${i*40};${i*50};${i*60}"
    }
}

Example

id;A_s;B_i;C_s;D_s;E_s;F_s
1;10;20;30;40;50;60
2;20;40;60;80;100;120
3;30;60;90;120;150;180
4;40;80;120;160;200;240
5;50;100;150;200;250;300
6;60;120;180;240;300;360
7;70;140;210;280;350;420
8;80;160;240;320;400;480
9;90;180;270;360;450;540
..

Note:

I performed no customisation to the "out of the box" Solr settings. This meant

  • Default Solr configuration requires a unique id field
  • Appending "_s" to the column name indicates string fields. See dynamic fields feature in Solr
  • Loaded Column B as an integer (B_i) to enable numeric sorting

See the Solr Wiki for details on how to write a custom schema.

Load CSV file

The Linux curl command is used to load the data.csv file, by performing a HTTP post operation:

$ curl 'http://localhost:8983/solr/update/csv?separator=;&commit=true' -H 'Content-type:text/plain; charset=utf-8' --data-binary @data.csv

Note:

  • 500,000 rows of data loaded in approx 90 seconds
  • Use separator parameter to indicate the data is not separated by the default ","
  • Use commit parameter to indicate data is committed to index at the end.

Query the data

The following query returns the top 50 rows sorted on column B

http://localhost:8983/solr/select/?q=*:*&rows=50&sort=B_i+desc&fl=id,A_s,B_i,C_s,D_s,E_s,F_s&wt=csv

Output is CSV formatted:

id,A_s,B_i,C_s,D_s,E_s,F_s
500000,5000000,10000000,15000000,20000000,25000000,30000000
499999,4999990,9999980,14999970,19999960,24999950,29999940
499998,4999980,9999960,14999940,19999920,24999900,29999880
499997,4999970,9999940,14999910,19999880,24999850,29999820
499996,4999960,9999920,14999880,19999840,24999800,29999760
..

REST parameters:

  |-------------------------------|------------------------------|
  | Rest parameter                | Description                  |                      
  |-------------------------------|------------------------------|
  | q=*:*                         | Everything.                  |
  | rows=50                       | Number of rows to output     |
  | sort=B_i+desc                 | descending order, sort on B_i|
  | fl=id,A_s,B_i,C_s,D_s,E_s,F_s | columns to include           |
  | wt=csv                        | CSV output                   |
  |-------------------------------|------------------------------|

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...